I think that’s a good idea, perhaps a marker trait that Hash is “sane”, i.e. ok for HashMap use. Although in my case I’m reproducing old programs in Rust, which have the silly behaviour of allowing these datastructures with float in the key of a hashmap. Given that I’m porting the system from Java though, I should probably coalesce all NaN values in my Eq/Hash implementations, since that’s what Float.equals does as mentioned earlier in the thread.
I don't see any reason that f32
/ f64
can't implement Hash
because of HashMap
, f32: !Eq
so the bounds impl<K: Hash + Eq, V> HashMap<K, V, RandomState>
will still disallow it from being used in a HashMap
. I would call Eq
the marker trait that Hash
is sane (given that Hash
matches PartialEq
in all cases, that's not explicitly stated in the docs but I feel like it should be).
I would expect that will be required though, even though f32
doesn't implement Eq
its Hash
implementation should match its PartialEq
implementation, given x == y
then hash(x) == hash(y)
MUST be true.
I think this is a broken result for f32 +0 and -1 since there's specific API that can be used to check sign and thus that information leaks and can result in hash(x) == hash(y)
but your_func(x) != your_func(y)
which is not what you need for caching.
At least for caching uses transmuting to u32/u64 and hashing that would be fine. So maybe there should just be a way to tell #derive(Hash)
to do that for structs?
That’s not broken, there is absolutely nothing guaranteeing that two different values hash to a different value,
impl Hash for f32 {
fn hash<H>(&self, state: &mut H) where H: Hasher {
state.write_u8(4);
}
}
is a 100% valid implementation of Hash
for f32
.
If a type implements Eq
then the contract for Hash
states that it must satisfy if x == y: hash(x) == hash(y)
, but there are no other constraints on Hash
.
If you care about +0
and -0
behaving differently you must check that yourself after receiving a matching hashed value. In that case you can’t use ==
directly on the f32
value since -0.0 == 0.0
is true, you will have to define your own equality comparison and use that instead.
I don't think there's any remaining use case for hashing that doesn't require that building a colision be an extremely difficult thing to do.
It's actually fully broken. Your HashTable
is now a Vec
and your software can now be victim of denial of service attacks by exploiting that fact. The use for old-school insecure hashes went away when it was demonstrated that non-secure hashes created exploitable software and thus secure and fast hashes were created.
All I'm saying is that it's an extremely useful property that if hash(x) == hash(y)
then it's not possible to do myfunc(x) != myfunc(y)
. As it turns out float values make that hard to do with a single equality operator because for math you want one thing and for testing repeated values you want another thing. Other languages end up with several equality operators because of this.
Within the confines of the current rust implementation this may be unsolvable but it's also not a very common use case. So having a way of telling #derive(Hash)
to transmute to u32 would probably be the best idea to solve this.
The rule is that x == y
implies hash(x) == hash(y)
, but not the other way around. So the hash
function alone tells you nothing about how myfunc
will compare, even just for the identity function.
I'm not saying x != y
implies hash(x) != hash(y)
. That's of course impossible to guarantee mathematically as you don't have infinite bits in your hash. But if you use a cryptographic hash it can be true in practice as causing a collision is practically impossible unless you find a fault in the hash.
Now, even though it's possible to avoid collisions in practice we would be guaranteeing them by setting hash(-0) == hash(+0)
. To me that's something important to avoid because rust explicitely supports .is_sign_negative()
and friends. We'd be setting hash(+0) == hash(-0)
when it's very easy to have myfunc(+0) != myfunc(-0)
. For caching applications of hashing that sucks and is completely avoidable.
You said “not possible” before, and I doubt even your updated “practically impossible.” Rust’s Hash
and Hasher
produce a u64
, which only needs about 5 billion items for a 50% probable birthday attack.
I agree that hash(-0) == hash(+0)
is a trivial collision, but it’s consistent with what std::hash
is designed for. I think this is just not the cryptographic hash you’re looking for…
We're discussing Hash
which doesn't imply 64bits as far as I can tell. Even though that's what Hasher
's .finish()
produces there's nothing stopping you from writing your own Hasher
and getting a 128bit or 256bit value out of it. At that point "practically impossible" means something that will take you more than the age of the universe to do which is only distinct from "not possible" in the mathematical sense and not the engineering one (modulo broken hashes).
rust has done very well to not require a specific hashing implementation with the Hash/Hasher
setup. So the hash being cryptographic or not is not the issue the interface should just work with SHA256 as well. The problem is the specific requirement about the relationship between PartialEq
and Hash
. That's actually rust specific and comes from the fact that there are different useful definitions of equality for floats. There's the definition of "all bits are the same or at least all outputs from safe APIs are the same" and then there's the definition of "for math purposes the values are the same". The second one is the equality f32/f64 currently implements but the first is the one that's actually useful to use together with hashing.
So maybe the real conclusion is that rust actually needs two equality traits to be able to fix this properly. One to do x == y
for use in math and another to do x.indistinguishable(y)
for use in hashing situations?
That would apply to anything which equality doesn't compare every detail. Your argument makes it sound like impl Hash for Rc<T>
should take the strong and weak count into account just because Rc::strong_count()
exists.
use std::rc::Rc;
use std::hash::{Hash, Hasher};
use std::collections::hash_map::DefaultHasher;
fn hash<T: Hash>(t: &T) -> u64 {
let mut hasher = DefaultHasher::new();
t.hash(&mut hasher);
hasher.finish()
}
fn main() {
let a = Rc::new(1);
let b = Rc::new(1);
let _c = Rc::clone(&b);
println!(" a == b: {}", a == b);
// true
println!(" hash: {} == {}", hash(&a), hash(&b));
// 1742378985846435984 == 1742378985846435984
println!("strong_count: {} != {}", Rc::strong_count(&a), Rc::strong_count(&b));
// 1 != 2
}
Rust already has two equality traits (Eq, PartialEq) we don't want the third one . If you have special requirement on the equality and hash, create a BitwiseDistinguishable<T>
wrapper type.
Since Rc is a wrapper around actual values I don't think anyone would care about that. You're not hashing the Rc, you're probably hashing the value and the counts are just out of band data about how the type is stored.
By the properties that were described about PartialEq
and Hash
then f32/f64 shouldn't implement hash ever, it would be broken.
That's already done and I've used it. Makes for ugly code though:
A better solution for my use case is to have a special derive that takes this into account.
See here for some more discussion related to this:
https://github.com/RustCrypto/hashes/issues/29
We’ve probably covered most of what there is to discuss about this so I’m going to try to do a synthesis post next for future reference.
Attempt at a comprehensive summary
This is an attempt at putting in a single place that can be referenced later what was discussed around the topic of hashing floats.
Base Facts
There are some facts about floats and also about how rust currently is setup that are important to clarify:
- Equality in floats is more complex than in most types because for math reasons you want things like
NAN != NAN
and+0 == -0
to be true. This is why floats in rust implementPartialEq
but notEq
as at least the reflexive property (a == a
) is broken by theNAN
case. - There is a different equality concept that’s generally useful which is that if
a == b
it’s not possible to writemyfunc(a) != myfunc(b)
. This is useful for caching situations and float equality in rust doesn’t guarantee this because+0 == -0
and yet.is_sign_negative()
and friends allow you to differentiate between them. Note that this isn’t a property that’s necessarily exclusive to floats. If someone ever ports rust to an architecture that does integers as sign and magnitude the same+0 == -0
case would exist. - The output of
Hash
cannot be relied upon to be stable. The same version of rust can return different values in different architectures. This is not a property of theHasher
that you’re using but instead of the wayHash
happens to be implemented for the type you’re using (e.g., the current implementation ofHash
for slices of integers returns different values in big and little-endian architectures).
If you want to use floats as HashMap keys
In general floats make for a poor choice as a map key because of the way equality is handled (see facts #1 and #2) so the best answer is just don’t do it. If you have a really good reason for it (e.g., caching the output of a really expensive func(f32)
) here’s what you’d probably need to do:
- Implement a separate equality operation that treats
+0 != -0
andNAN == NAN
. The easiest way to do that is probably to just transmutef32
tou32
andf64
tou64
and compare those but you may want to do something fancier like making all possibleNAN
values be equal. - You can then implement
Hash
by just hashing the underlying bytes after also doing normalization ofNAN
values if you did that for equality.
To implement this in the standard rust library a new equality trait would need to be created so that for math NAN != NAN
and for hashing NAN == NAN
. Since this is a corner case the easiest solution is to just do a wrapper type around floats like the ordered-float crate does, for the few situations where you want it.
If you want to have a content hash of a struct that includes floats
In some situations hashing is used to get a unique identifier for some content calculated from the content itself. For this Hash
is not workable as it doesn’t guarantee the results are stable between architectures (see fact #3). Here your best solution is to use serde’s Serialize
on your struct and then implement a Serializer
that just passes the content onto a cryptographic hash. This can almost surely be done generically and so there will probably be a turn-key solution in the near future (see the discussion here). There are also fancier solutions for this in the works like the objecthash crate.
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.