RandomState::new_non_random

Hi folks,

a feature request, I tried to open an issue, and it pointed me here.

Let's add a function:

impl RandomState {
  const fn new_non_random() -> RandomState { ... }
}

Create a "random state" which is not random, just some unspecified constants.

There are reasons to do that:

  • make possible to make HashMap deterministic, which might be useful in certain situations even at risk of DOS (for example, when debugging tests, to make test outputs repeatable to debug them easier: if test fails on fifth iteration, we can check third break on each run)
  • make HashMap::with_hasher const fn. Which is useful to put HashMap in a static variable (note hashbrown HashMap constructor is already const fn)

The issues mentioned above can be solved by implementing own random state object (and making with_hasher const fn), but that would complicate the client code: will need to propagate that type parameter everywhere. And public API will be worse.

These are trivial to implement obviously. Are there reasons not to?

This feels like a good reason to not do it for RandomState. I don't think it's a good idea for the laziest way to have a global hashmap to be one that's not actually HashDoS-resistant.

Now, having with_hasher be const fn does seem like a good thing to do. If something like FxHashMap can have a const creation function, great!

2 Likes

Empty HashMap (which is my use case) is perfectly DoS-safe. It's not possible (and not proposed) to put non-empty HashMap to a static variable anyway.

Also, there's a bazillion use cases where DoS resistance is not an issue (most tools for example).

It should not be easy to create such less safe hash maps, but it should be possible to do so. I would suggest marking this function unsafe, but this function does not lead to memory violation, and unsafe is only meant to prevent memory issues.

It is not about laziness. It is about performance (lazy static is not free) and providing easy to use API. Consider this:

struct InnerMessage {
  some_payload: HashMap<String, String>,
  ... other fields
}

struct OtherMessage {
  inner_message: Option<InnerMessage>
}

impl OtherMessage {
  fn inner_message(&self) -> &InnerMessage {
    match &self.inner_message {
      Some(m) => m,
      None => &InnerMessage {
        some_payload: HashMap::with_hasher(RandomHasher::new_non_random()),
        ...
      }
    }
  }
}

Without ability to initialize HashMap with non-random state, this pattern is not possible, we need to define either lazy static or some boilerplate code which will be a burden for users of this API with no real benefits.

1 Like

... although, I now see a flaw in this proposal, which is: RandomState::clone() preserves it's state, and HashMap::clone does not reset that RandomState.

So cloned safe empty HashMap with non-random state becomes DoS-unsafe when populated with new elements.

RandomState could be made lazily-initialized on first insert (much harder to implement retaining backwards compatibility but possible), but that would make hashtable somewhat more expensive.

I wouldn't call that random state, more like your own hasher. The type parameter propagation can be easily solved with a type alias.

The only problem is that with_hasher is not const, but that can easily be solved (in fact hashbrown::HashMap::with_hasher is already const, so it's only a matter to change it in the stdlib).

You can put it in a Mutex and populate it at runtime.

You can always use another hasher for that, like fnv, ahash, fxhash ecc ecc.

But it's already possible, just not with a RandomState.

Not really, it's for anything that can cause UB.

1 Like

Type alias is much less convenient for users of the library. Users want HashMap, not an alias to HashMap.

And even if I do my own "build hasher", it does not prevent the problem of DoS. Just complicates the API, and does not solve anything.

I'd be curious to know if BTreeMap would be a workable alternative to HashMap in this case

Most performance-critical code needs hashtables, not trees.

I think the answer to that question depends greatly on the use case. B-trees are the heart of many high-performance in-memory databases, and branch predictors have been heavily optimized for B-trees.

Can you perhaps be a little more specific about the circumstances where you want a deterministic HashMap but a BTreeMap is unsuitable? The thing about a deterministic HashMap is the order of the elements is still not lexicographically ordered, which in cases where I want determinism is generally very much a nice-to-have.

2 Likes

My main wish is to be able to make a pointer to an empty HashMap (similarly to how it's possible to do return &Vec::new()). Just to make API more clear: some users don't want to deal with custom type parameter, type aliases, custom HashMap types.

If I understand it correctly, you want to put an empty HashMap to a static variable without mutability and is very performance sensitive... I don't think I understand it correctly. Can you please describe your use case more specifically?

1 Like

The use case is described above, in the comment which starts with "It is not about laziness. It is about performance (lazy static is not free) and providing easy to use API."

I mean, why do you want an empty HashMap without mutability?

2 Likes

To be able to construct const InnerMessage object, to be able to return a pointer to it when the actual object is not available, like here:

impl OtherMessage {
  fn inner_message(&self) -> &InnerMessage {
    match &self.inner_message {
      Some(m) => m,
      None => &InnerMessage {
        some_payload: HashMap::with_hasher(RandomHasher::new_non_random()),
        ...
      }
    }
  }
}

If you are concerned with performance, have you tried measuring?

You are proposing an alleged micro-optimization for a hyper-specific use case arguing it will improve performance (to the potential detriment of security in so much as you yourself claimed "I would suggest marking this function unsafe , but this function does not lead to memory violation, and unsafe is only meant to prevent memory issues", but conspicuously lacking in this thread is any kind of measurement whatsoever which would validate your performance claims.

I hope I don't come off condescending here, because I'm not trying to be. However, I am a stickler for precise measurements of performance claims, especially ones which invalidate security invariants for speed.

Have you tried measuring:

  • HashMap vs BTreeMap for your described use case
  • Any kind of synthetic benchmark whatsoever for what Rust does presently versus your proposed changes

I don't have time today to dig through the HashMap implementation and figure out how feasible this is, but I wonder if we could get the effect @stepancheg wants by having HashMap::new() defer initialization of more of its internal state until the first insertion. It already doesn't allocate space for the bucket array, why not also avoid grabbing any randomness for the hasher?

3 Likes

Given that hasher is part of HashMap's API, it seems that shared references could force the map to generate and store that randomness. As such, implementing this would require making that deferred initialization thread-safe, at least for that function, which might pessimize it for the typical cases.

2 Likes

Concretely, and having actually looked at the code, what I was suggesting is that std::collections::hash_map::RandomState would initialize its internal state lazily, on the first call to build_hasher, rather than in RandomState::new(). (For instance, by using std::lazy::OnceCell.) This would involve an extra test on every call to build_hasher, but not a thread-safe one (since one is already not expected to be able to use the same RandomState object from multiple threads concurrently without outer locking).

I'm not sure what you mean by that when RandomState is Sync, and you can also do e.g. this:

let map = HashMap::<String, String>::new();
crossbeam::thread::scope(|s| {
    let a = s.spawn(|_| map.hasher().build_hasher().finish());
    let b = s.spawn(|_| map.hasher().build_hasher().finish());
    let a: u64 = a.join().unwrap();
    let b: u64 = b.join().unwrap();
    // By the documentation the threads obtain identical hashers, so these
    // hashes should match
    assert_eq!(a, b);
})

playground