I'm having trouble fuzzing and benchmarking Rust programs because of the random seeding of std::collections::HashMap. The random behaviour makes fuzzing instrumentation see differences when there are none. People using cargo fuzz will just have poor results with no indication of a problem. AFL++ reported the stability of my target to be only 60-70%. Disabling randomization brought the stability to 100.00%
In theory, people can support different hashers in their code. But HashMap is used in many common dependencies as well, so they'd have to override every dependency. Instead, I used -Z build-std and edited the constructor of RandomState to be deterministic. Before that, I tried using cargo's patch feature to override hashbrown but that didn't work.
I'm sure many people would benefit from being able to do this in a clean way.
In std's code hashmaps have their own code for true randomness. Maybe the system clock should also be disabled. Other randomness can probably be dealt with by using cargo's patch feature on the rand crate.
I don't know what the best solution would be but I wanted to write about this since I didn't find a thread anywhere even though this problem makes Rust essentially unfuzzable.
I did find this other thread just now, which is about problems benchmarking. The getrandom solution there is cleaner than what I have but I would never have thought of that even though I read the relevant source code.
By the way, the use of an OS entropy source is AFAIK the only reason Hash(Map|Set) are not in alloc. It would be awesome if whatever mechanism used to turn off nondeterminism would also enable non-std use. Conversely, it would be pretty frustrating if there were a mechanism to make the two collections entirely OS-agnostic but they still weren't available in alloc.
I think having two mutually exclusive features (liballoc HashMap and HashDOS protected HashMap) in the standard library would be pretty bad. Some libraries will depend on one feature and others on the other feature, splitting the ecosystem.
In fact the only reason HashMap is in std is the default generic parameter:
pub struct HashMap<K, V, S = RandomState>
Indeed the culprit is a single call in RandomState::new:
Instantiating HashMap with any alloc-compatible hasher makes it itself alloc-compatible, it's merely the existence of the default parameter that binds it in std. This is not an optimal situation. The hasher is made to be swappable (and is, of course, commonly changed to some faster, deterministic version when DDOS is irrelevant) so I don't think it's a big problem that no_std compiled code uses a different instantiation than std.
One way to switch off nondeterminism in the current language is to simply use a type alias, something like:
type HashMap<K, V, S = RandomState> = std::collections::HashMap<K, V, S>;
type HashMap<K, V, S = MyDeterministicRS> = std::collections::HashMap<K, V, S>;
You can't implement HashMap::new in libstd if HashMap is in liballoc, but it needs to be for HashMap::new to create a new RandomState as hasher. Making it generic over the hasher will break type inference for basically all code using HashMap::new.
changing the bound from S = RandomState to S: Default would be considered a minor breaking change (though of high impact, likely requiring defaults to affect inference in order to be accepted);
std (and alloc/core) has access to the perma-unstable rustc-specific internal hack to bypass this (#[rustc_allow_incoherent_impl]) which is already used elsewhere in order to pretend to be a single coherence unit, and which could be used to provide alloc::collections::HashMap::<_, _, RandomState>::new from std;
(primarily used for inherent impls on primitives)
(but is also used for alloc impls on non-primitive core types, e.g. CStr::to_string_lossy)
(but I did not find one in std not on a primitive type)
there's hope towards eventually extending this style of coherent "friend" crates beyond its private use for core/alloc/std; and
there's hope towards eventually collapsing the std facade into one (logical) crate (moreso than the current status) via features and the mythical "portability lint."
If I remember well #[rustc_allow_incoherent_impl] doesn't support trait implementations, and there's an impl of FromIterator for HashMap<K, V, RandomState> which would still be a blocker for HashMap in alloc. Also, std implements HashMap by delegating to hashbrown, which depends on alloc, so moving HashMap to alloc would create a cyclic dependency (which I guess could be fixed by copy-pasting hashbrown in alloc?)
This would be an awesome change. I had a nasty surprise with HashMap while developing for a certain embedded environment. While in general it has std, the getrandom call specifically is not supported, and caused the device to hang. It took a lot of trial and error to find the issue, particularly since HashMap::new just doesn't look like something capable of nasty surprises, and RandomState isn't something one even thinks about 99,99% of the time.
Imho it's one of the cases where rand not being part of std bites people. Ideally HashMap should have explicitly taken an RNG for the RandomState generation, and the current HashMap::new would be a common convenience method which passes the system RNG.
I think it would be nice if I could fuzz code without requiring the code to manually pass a fuzzing-friendly RNG.
One option would be to use #[cfg(fuzzing)] in std (to turn determinism on), though this would require shipping a separate std binary built just for fuzzing. And I don't know if afl.rs builds with #[cfg(fuzzing)].
Another option would be to have some global function that returns whether or not determinism should be used. When building for fuzzing rustc could link against a definition that returns true (i.e. deterministic behavior), and when building for normal non-fuzzing purposes it could link against a definition that returns false (i.e. nondeterministic behavior). This shouldn't require rebuilding std or shipping two separate std binaries. And if built with LTO then LLVM should be able to inline the definition and trim the dead code.
There is a vague proposal to integrate getrandom into std and allow to overwrite its implementation similarly to alloc. It would allow to move HashMap into alloc (probably, behind a feature flag?) and would make it possible to use PRNG with fixed seed as a getrandom impl for fuzzing and testing. It's not enough for full reproducibility with multi-threaded code (program results may differ depending on whether thread A or B got to call getrandom first), but since HashMap cashes random seed after first initialization and then simply increments it, it will not be a big issue if you do the first initialization before spawning threads.
Overriding getrandom on Linux actually isn't as good as modifying the standard library. It does make each run of a program deterministic but fuzzers want to run multiple experiments in a single program to reduce process spawning overhead. They would need some way to reset the global rng, which sounds like a difficult thing to implement.
Big programs where it is difficult to change every instance of Hashmap will probably be so slow that the classic method of forking the process is fast enough, though.