RandomState::new_non_random

stepancheg · February 12, 2021, 3:34am

Hi folks,

a feature request, I tried to open an issue, and it pointed me here.

Let's add a function:

impl RandomState {
  const fn new_non_random() -> RandomState { ... }
}

Create a "random state" which is not random, just some unspecified constants.

There are reasons to do that:

make possible to make HashMap deterministic, which might be useful in certain situations even at risk of DOS (for example, when debugging tests, to make test outputs repeatable to debug them easier: if test fails on fifth iteration, we can check third break on each run)
make HashMap::with_hasher const fn. Which is useful to put HashMap in a static variable (note hashbrown HashMap constructor is already const fn)

The issues mentioned above can be solved by implementing own random state object (and making with_hasher const fn), but that would complicate the client code: will need to propagate that type parameter everywhere. And public API will be worse.

These are trivial to implement obviously. Are there reasons not to?

scottmcm · February 12, 2021, 4:53am

This feels like a good reason to not do it for RandomState. I don't think it's a good idea for the laziest way to have a global hashmap to be one that's not actually HashDoS-resistant.

Now, having with_hasher be const fn does seem like a good thing to do. If something like FxHashMap can have a const creation function, great!

stepancheg · February 12, 2021, 5:24am

Empty HashMap (which is my use case) is perfectly DoS-safe. It's not possible (and not proposed) to put non-empty HashMap to a static variable anyway.

Also, there's a bazillion use cases where DoS resistance is not an issue (most tools for example).

It should not be easy to create such less safe hash maps, but it should be possible to do so. I would suggest marking this function unsafe, but this function does not lead to memory violation, and unsafe is only meant to prevent memory issues.

It is not about laziness. It is about performance (lazy static is not free) and providing easy to use API. Consider this:

struct InnerMessage {
  some_payload: HashMap<String, String>,
  ... other fields
}

struct OtherMessage {
  inner_message: Option<InnerMessage>
}

impl OtherMessage {
  fn inner_message(&self) -> &InnerMessage {
    match &self.inner_message {
      Some(m) => m,
      None => &InnerMessage {
        some_payload: HashMap::with_hasher(RandomHasher::new_non_random()),
        ...
      }
    }
  }
}

Without ability to initialize HashMap with non-random state, this pattern is not possible, we need to define either lazy static or some boilerplate code which will be a burden for users of this API with no real benefits.

stepancheg · February 12, 2021, 5:42am

... although, I now see a flaw in this proposal, which is: RandomState::clone() preserves it's state, and HashMap::clone does not reset that RandomState.

So cloned safe empty HashMap with non-random state becomes DoS-unsafe when populated with new elements.

RandomState could be made lazily-initialized on first insert (much harder to implement retaining backwards compatibility but possible), but that would make hashtable somewhat more expensive.

SkiFire13 · February 12, 2021, 1:07pm

I wouldn't call that random state, more like your own hasher. The type parameter propagation can be easily solved with a type alias.

The only problem is that with_hasher is not const, but that can easily be solved (in fact hashbrown::HashMap::with_hasher is already const, so it's only a matter to change it in the stdlib).

You can put it in a Mutex and populate it at runtime.

You can always use another hasher for that, like fnv, ahash, fxhash ecc ecc.

But it's already possible, just not with a RandomState.

Not really, it's for anything that can cause UB.

stepancheg · February 12, 2021, 7:30pm

Type alias is much less convenient for users of the library. Users want HashMap, not an alias to HashMap.

And even if I do my own "build hasher", it does not prevent the problem of DoS. Just complicates the API, and does not solve anything.

bascule · February 12, 2021, 7:44pm

I'd be curious to know if BTreeMap would be a workable alternative to HashMap in this case

stepancheg · February 12, 2021, 9:26pm

Most performance-critical code needs hashtables, not trees.

bascule · February 13, 2021, 2:10am

I think the answer to that question depends greatly on the use case. B-trees are the heart of many high-performance in-memory databases, and branch predictors have been heavily optimized for B-trees.

Can you perhaps be a little more specific about the circumstances where you want a deterministic HashMap but a BTreeMap is unsuitable? The thing about a deterministic HashMap is the order of the elements is still not lexicographically ordered, which in cases where I want determinism is generally very much a nice-to-have.

stepancheg · February 13, 2021, 2:25am

My main wish is to be able to make a pointer to an empty HashMap (similarly to how it's possible to do return &Vec::new()). Just to make API more clear: some users don't want to deal with custom type parameter, type aliases, custom HashMap types.

hyeonu · February 13, 2021, 2:42pm

If I understand it correctly, you want to put an empty HashMap to a static variable without mutability and is very performance sensitive... I don't think I understand it correctly. Can you please describe your use case more specifically?

stepancheg · February 13, 2021, 9:08pm

The use case is described above, in the comment which starts with "It is not about laziness. It is about performance (lazy static is not free) and providing easy to use API."

hyeonu · February 14, 2021, 5:43am

I mean, why do you want an empty HashMap without mutability?

stepancheg · February 14, 2021, 5:26pm

To be able to construct const InnerMessage object, to be able to return a pointer to it when the actual object is not available, like here:

impl OtherMessage {
  fn inner_message(&self) -> &InnerMessage {
    match &self.inner_message {
      Some(m) => m,
      None => &InnerMessage {
        some_payload: HashMap::with_hasher(RandomHasher::new_non_random()),
        ...
      }
    }
  }
}

bascule · February 18, 2021, 8:07am

If you are concerned with performance, have you tried measuring?

You are proposing an alleged micro-optimization for a hyper-specific use case arguing it will improve performance (to the potential detriment of security in so much as you yourself claimed "I would suggest marking this function unsafe , but this function does not lead to memory violation, and unsafe is only meant to prevent memory issues", but conspicuously lacking in this thread is any kind of measurement whatsoever which would validate your performance claims.

I hope I don't come off condescending here, because I'm not trying to be. However, I am a stickler for precise measurements of performance claims, especially ones which invalidate security invariants for speed.

Have you tried measuring:

HashMap vs BTreeMap for your described use case
Any kind of synthetic benchmark whatsoever for what Rust does presently versus your proposed changes

zackw · February 19, 2021, 2:13pm

I don't have time today to dig through the HashMap implementation and figure out how feasible this is, but I wonder if we could get the effect @stepancheg wants by having HashMap::new() defer initialization of more of its internal state until the first insertion. It already doesn't allocate space for the bucket array, why not also avoid grabbing any randomness for the hasher?

quaternic · February 19, 2021, 4:20pm

Given that hasher is part of HashMap's API, it seems that shared references could force the map to generate and store that randomness. As such, implementing this would require making that deferred initialization thread-safe, at least for that function, which might pessimize it for the typical cases.

zackw · February 19, 2021, 8:28pm

Concretely, and having actually looked at the code, what I was suggesting is that std::collections::hash_map::RandomState would initialize its internal state lazily, on the first call to build_hasher, rather than in RandomState::new(). (For instance, by using std::lazy::OnceCell.) This would involve an extra test on every call to build_hasher, but not a thread-safe one (since one is already not expected to be able to use the same RandomState object from multiple threads concurrently without outer locking).

quaternic · February 20, 2021, 11:37pm

I'm not sure what you mean by that when RandomState is Sync, and you can also do e.g. this:

let map = HashMap::<String, String>::new();
crossbeam::thread::scope(|s| {
    let a = s.spawn(|_| map.hasher().build_hasher().finish());
    let b = s.spawn(|_| map.hasher().build_hasher().finish());
    let a: u64 = a.join().unwrap();
    let b: u64 = b.join().unwrap();
    // By the documentation the threads obtain identical hashers, so these
    // hashes should match
    assert_eq!(a, b);
})

playground

system · May 21, 2021, 11:37pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Global default hash function override for HashMap and HashSet libs	26	1203	August 20, 2024
Support turning off hashmap randomness language design	16	2402	January 21, 2024
First contribution, guideline questions: impl Default for std::time::Instant libs	23	3149	January 5, 2020
Pre-RFC: bless the `type FastHashMap` pattern	92	3977	March 25, 2019
HashMap/Set new/with_capacity and BuildHasher libs	8	1096	February 14, 2022

RandomState::new_non_random

Related topics