Support turning off hashmap randomness

joonazan · July 26, 2023, 3:10pm

I'm having trouble fuzzing and benchmarking Rust programs because of the random seeding of std::collections::HashMap. The random behaviour makes fuzzing instrumentation see differences when there are none. People using cargo fuzz will just have poor results with no indication of a problem. AFL++ reported the stability of my target to be only 60-70%. Disabling randomization brought the stability to 100.00%

In theory, people can support different hashers in their code. But HashMap is used in many common dependencies as well, so they'd have to override every dependency. Instead, I used -Z build-std and edited the constructor of RandomState to be deterministic. Before that, I tried using cargo's patch feature to override hashbrown but that didn't work.

I'm sure many people would benefit from being able to do this in a clean way.

In std's code hashmaps have their own code for true randomness. Maybe the system clock should also be disabled. Other randomness can probably be dealt with by using cargo's patch feature on the rand crate.

I don't know what the best solution would be but I wanted to write about this since I didn't find a thread anywhere even though this problem makes Rust essentially unfuzzable.

I did find this other thread just now, which is about problems benchmarking. The getrandom solution there is cleaner than what I have but I would never have thought of that even though I read the relevant source code.

8573 · July 27, 2023, 2:05am

It could make sense to file a ticket at https://github.com/rust-fuzz/book/issues to recommend a solution, after (or before) a solution is found here.

bascule · July 30, 2023, 8:26pm

FYI, while this may make sense for fuzzing, for the benchmarking use case it may have unexpected effects on HashMap performance including degrading certain O(n) operations to O(n^2). This is because the HashMap implementation relies on a randomized hasher as a "fix" for "accidentally quadratic" behavior.

8573 · July 30, 2023, 8:40pm

This was done:

github.com/rust-fuzz/book

Warn about hashmap randomization and propose solutions

opened 03:34PM - 27 Jul 23 UTC

joonazan

Almost all Rust code uses `std::collections::HashMap` with the default hasher. T…his makes instrumentation almost useless. For example, one of my targets didn't even manage to produce a valid input after days. After I disabled Hashmap randomization, fuzzing even found a crash in a few hours. ## Ways to solve this My solution: edit the standard library. Cross-platform, but not easy to package. LLVM solution: do not instrument hash maps. It is unclear to me how to do this because [the instructions](https://github.com/AFLplusplus/AFLplusplus/blob/stable/instrumentation/README.instrument_list.md) are C++ specific. Linux solution: define your own `getrandom` function. I also recommend AFL++ first, because it tracks whether the same input produces the same trace and warns about it. For a more wordy explanation, see this post: https://internals.rust-lang.org/t/support-turning-off-hashmap-randomness/19234

joonazan · July 31, 2023, 12:26am

@bascule That is very good to know! So instead of a fixed seed, a pseudorandom seed should be used for deterministic results; especially when benchmarking.

jdahlstrom · July 31, 2023, 6:07pm

By the way, the use of an OS entropy source is AFAIK the only reason Hash(Map|Set) are not in alloc. It would be awesome if whatever mechanism used to turn off nondeterminism would also enable non-std use. Conversely, it would be pretty frustrating if there were a mechanism to make the two collections entirely OS-agnostic but they still weren't available in alloc.

bjorn3 · July 31, 2023, 6:51pm

I think having two mutually exclusive features (liballoc HashMap and HashDOS protected HashMap) in the standard library would be pretty bad. Some libraries will depend on one feature and others on the other feature, splitting the ecosystem.

jdahlstrom · July 31, 2023, 7:22pm

In fact the only reason HashMap is in std is the default generic parameter:

pub struct HashMap<K, V, S = RandomState>
                             ^^^^^^^^^^^

Indeed the culprit is a single call in RandomState::new:

sys::hashmap_random_keys()

Instantiating HashMap with any alloc-compatible hasher makes it itself alloc-compatible, it's merely the existence of the default parameter that binds it in std. This is not an optimal situation. The hasher is made to be swappable (and is, of course, commonly changed to some faster, deterministic version when DDOS is irrelevant) so I don't think it's a big problem that no_std compiled code uses a different instantiation than std.

One way to switch off nondeterminism in the current language is to simply use a type alias, something like:

#[cfg(not(test))]
type HashMap<K, V, S = RandomState> = std::collections::HashMap<K, V, S>;
#[cfg(test)]
type HashMap<K, V, S = MyDeterministicRS> = std::collections::HashMap<K, V, S>;

bjorn3 · July 31, 2023, 8:07pm

You can't implement HashMap::new in libstd if HashMap is in liballoc, but it needs to be for HashMap::new to create a new RandomState as hasher. Making it generic over the hasher will break type inference for basically all code using HashMap::new.

CAD97 · July 31, 2023, 9:17pm

While this is true in the strict sense,

changing the bound from S = RandomState to S: Default would be considered a minor breaking change (though of high impact, likely requiring defaults to affect inference in order to be accepted);
std (and alloc/core) has access to the perma-unstable rustc-specific internal hack to bypass this (#[rustc_allow_incoherent_impl]) which is already used elsewhere in order to pretend to be a single coherence unit, and which could be used to provide alloc::collections::HashMap::<_, _, RandomState>::new from std;
- (primarily used for inherent impls on primitives)
- (but is also used for alloc impls on non-primitive core types, e.g. CStr::to_string_lossy)
- (but I did not find one in std not on a primitive type)
there's hope towards eventually extending this style of coherent "friend" crates beyond its private use for core/alloc/std; and
there's hope towards eventually collapsing the std facade into one (logical) crate (moreso than the current status) via features and the mythical "portability lint."

In short: yes, but hopefully eventually no.

SkiFire13 · August 1, 2023, 7:10am

If I remember well #[rustc_allow_incoherent_impl] doesn't support trait implementations, and there's an impl of FromIterator for HashMap<K, V, RandomState> which would still be a blocker for HashMap in alloc. Also, std implements HashMap by delegating to hashbrown, which depends on alloc, so moving HashMap to alloc would create a cyclic dependency (which I guess could be fixed by copy-pasting hashbrown in alloc?)

afetisov · August 1, 2023, 1:01pm

This would be an awesome change. I had a nasty surprise with HashMap while developing for a certain embedded environment. While in general it has std, the getrandom call specifically is not supported, and caused the device to hang. It took a lot of trial and error to find the issue, particularly since HashMap::new just doesn't look like something capable of nasty surprises, and RandomState isn't something one even thinks about 99,99% of the time.

Imho it's one of the cases where rand not being part of std bites people. Ideally HashMap should have explicitly taken an RNG for the RandomState generation, and the current HashMap::new would be a common convenience method which passes the system RNG.

mjbshaw · August 3, 2023, 4:45pm

I think it would be nice if I could fuzz code without requiring the code to manually pass a fuzzing-friendly RNG.

One option would be to use #[cfg(fuzzing)] in std (to turn determinism on), though this would require shipping a separate std binary built just for fuzzing. And I don't know if afl.rs builds with #[cfg(fuzzing)].

Another option would be to have some global function that returns whether or not determinism should be used. When building for fuzzing rustc could link against a definition that returns true (i.e. deterministic behavior), and when building for normal non-fuzzing purposes it could link against a definition that returns false (i.e. nondeterministic behavior). This shouldn't require rebuilding std or shipping two separate std binaries. And if built with LTO then LLVM should be able to inline the definition and trim the dead code.

newpavlov · August 3, 2023, 5:49pm

There is a vague proposal to integrate getrandom into std and allow to overwrite its implementation similarly to alloc. It would allow to move HashMap into alloc (probably, behind a feature flag?) and would make it possible to use PRNG with fixed seed as a getrandom impl for fuzzing and testing. It's not enough for full reproducibility with multi-threaded code (program results may differ depending on whether thread A or B got to call getrandom first), but since HashMap cashes random seed after first initialization and then simply increments it, it will not be a big issue if you do the first initialization before spawning threads.

joonazan · October 23, 2023, 4:21pm

Overriding getrandom on Linux actually isn't as good as modifying the standard library. It does make each run of a program deterministic but fuzzers want to run multiple experiments in a single program to reduce process spawning overhead. They would need some way to reset the global rng, which sounds like a difficult thing to implement.

Big programs where it is difficult to change every instance of Hashmap will probably be so slow that the classic method of forking the process is fast enough, though.

the8472 · October 23, 2023, 6:07pm

maybe they could do some library loading shenanigans to get separate instances of the standard library and the function-to-fuzz?

system · January 21, 2024, 6:07pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pre-RFC: bless the `type FastHashMap` pattern	92	3947	March 25, 2019
Stability of hash values documentation	8	5317	March 25, 2019
RandomState::new_non_random libs	19	1720	May 21, 2021
Maybe rust is too stable?	11	2735	January 17, 2020
Unstable hash architecture ideas (deprecated)	13	3326	March 25, 2019

Support turning off hashmap randomness

Related Topics