I’d be a bit hesitant to do that - the semantics we want are those of
read, and accidentally short-reading your random data is very bad! In addition, none of the extra functionality (
lines, etc) makes any sense in the context of an RNG. Matching
Rng up to
Read because a method has a similar signature seems like a false equivalence.
I’d be a bit hesitant to do that - the semantics we want are those of
Read is just a stream of bytes that can be read. Nothing in
Read says, that the stream must be finite or that it must be text, and in fact in many cases it is not. All the counter points that you list here apply to any kind of byte stream, not just random data.
Any stream of bytes should implement
Read and if that does not fit here, then
Read is a bad abstraction.
I do agree, that text based methods like
lines don’t make sense. But they don’t make sense for most other byte streams either. IMO they don’t belong in
Read at all.
By that same logic one could argue that it should be a subtrait of
Iterator<Item=u8>; but one can’t expect every Rng to implement both with matching output.
I believe it is more reasonable that if somebody really wants to use an arbitrary Rng as a Read, there could be a wrapper adapter type.
Well yes, I think iterators and streams are very much related and it would be great if they were unified. But that’s not how it is and
Read is very similar to what is proposed here, while
Iterator<Item=u8> is not.
Personally, I wouldn’t even care about a separate trait at all and an RNG can implement
Iterator<Item=u8> or both.
Okay, there are two quite different uses for RNGs:
- (scientific) simulations and games
For the former, quality of randomness (evenness of distribution and independence of samples) can be important. Performance is sometimes important. Having several well-known, seedable generators is useful (e.g. once I reproduced results from a MATLAB stochastic model using MT19937 in C++, thus proving the new implementation matched the old).
For the latter, I suspect many of you know more than me, but as I understand it, there are essentially two uses: (1) get a very high quality random number, likely using OS support, and (2) get a pretty good random number more quickly using a cryptographically-proven RNG.
There’s another dimension to generators: interaction with threads. If you want a crypto-RN in a multi-threaded app, the
thread_rng approach works fine (as I understand it). If you want a deterministic simulation using an RNG, well, it’s hard to use any randomness in threads at all, but it may be possible by creating a “sub-RNG” at specific points seeded from a master-RNG (using a different algorithm to ensure indepedence of samples from the master RNG), and passing the sub-RNG to a worker thread.
With the above in mind,
rand could potentially be split into three crates, although I’m not sure it’s a good idea:
- a crate containing a pseudo-random-number-generator (RNG) trait, and several implementations (crypto-grade, simulation-grade and otherwise)
- a crate providing something like
thread_rngwhich is crypto-grade, using OS-provided numbers and possibly a good RNG-based generator, all as a convenient source of “reasonbly secure” random numbers; this might depend on the first crate
- a crate providing a set of distributions and support for generating numbers given an RNG implementing the trait in the first crate
Crypto apps might just use the second crate. Scientific simulators might use the first and third, doing their own initialisations. Games and such might use the second for convenience or do the same as simulators if they need more control or performance.
As a specific case, games with procedural generation are only going to want the first one (seeded PRNG) so that the save file can just store the seed without storing the whole level.
Yeah, games with any kind of reproducibility fall into pretty much the same category as scientific simulations, except quality of RNGs may be a little less important. But they may well still want the third crate (distributions).
Regarding RNGs specifically:
- some algorithms produce 32-bit numbers natively
- some algorithms produce 64-bit numbers natively
- ability to convert from one size to the other is necessary to satisfy a common trait, but forcing unnecessary conversions (e.g.
u64 → [u8] → u64to satisfy a
fill_bytesmethod requirement) is almost certainly going to add unoptimisable overhead
The current Rng trait has
fill_bytes methods specifically for this reason, with default implementations of the latter two in terms of the previous one.
Rng also has
next_f64; I’m not sure whether there are any native FP algorithms worth using over bit-conversion-from-unsigned-integer (I would guess not, outside of specifically trying to reproduce results from some other implementation).
Rng has support for generating weighted booleans (a distribution), selecting one of a set, genernating
a-zA-Z0-9 ASCII chars, shuffling a slice, and generating random values of arbitrary type. Potentially useful, but doesn’t belong in the RNG trait in my opinion.
Getting back to
next_u64, I took a look at this Mersenne Twister implementation (both 32-bit and 64-bit generators), and benchmarked generation.
next_u32 by casting a whole
u32; at least in my code attempting to save the remaining bits appears to reduce performance.
I’m not sure how much breakage is considered appropriate at this point; @brson already suggested removing everything from this module and starting fresh; personally I don’t feel that much breakage is necessary, but do see quite a few parts which could be tweaked.
I may put some further thought into this and start submitting PRs against
rand but I’d rather get some feedback first.
I do a lot of genetic algorithm stuff, I’m a strong advocate for reproducibility. Seedable algorithms should guarantee to not change in time and would be very useful having the implementation available in the docs (pseudo-code?) as reference too.
Random question: why
XorShiftRng is implemented using
[u32; 4] and not
&'a [u32; 4] like all the others ? It’s very annoying.
I have refined some of my questions to proposals, and elaborated on others, in a new document. (Is there a good way to handle RFCs for libs?)
In terms of the default ranges of f32 and u32 not matching, that’s fine. That’s largely what people expect. Adding something like
f32_any_bits if you want literally any possible f32 bit arrangement is better.
I can’t imagine a use-case for
distributions::exponential::Exp if you want an exponential distribution.
[0, 1) is an arbitrary choice; I suppose you’re right though: the default range differs for each integer type too.
It was mentioned earlier in the topic: fuzzers and other quickcheck style code care about the possibility of making all possible float values. Those specialized libs can just have a lib-specific function among their utilities though.
I wholeheartedly agree - cryptographically secure by default is a must.
These two speak to a common misconception that I think it’s important we don’t fall prey to: There is precisely one type of cryptographically secure algorithm. There’s no such thing as a CSPRNG that is “pretty good” or that needs to be reseeded in order to be secure. If it has either of those two properties, then it is not cryptographically secure in the first place.
Here are two fantastic resources on the subject:
- Myths about /dev/urandom
- Recommendations for Randomness in the Operating System - Section 3 of this paper, “Realities of Randomness,” covers a number of common misconceptions (including this one) about cryptographically-secure random number generation.
Why user-space CSPRNGs are dangerous
In section 3.4 of the second paper linked above, entitled “User-Space is a Danger Zone,” the authors lay out a number of points about maintaining CSPRNGs in user space that we should be very careful about. In particular:
- Without very careful implementations, user-space CSPRNGs are fork-unsafe. If a process forks, both the parent and child end up with identical copies of memory, which means that unless extra precautions are taken, the two processes’ CSPRNGs will produce identical output after the fork.
- There are a number of subtle, surprising ways that the state of a user-space CSPRNG can be leaked - for example, swap space for memory. (see Shredding Your Garbage: Reducing Data Lifetime Through Secure Deallocation)
- Reseeding may be required on systems where blocking for sufficient entropy isn’t supported (notably,
/dev/urandomon Linux; the
getrandomsyscall is the recommended mechanism for getting randomness that doesn’t have this limitation, but it is not available on older kernels). In this case, calls made for randomness shortly after boot are insecure for all programs, but for programs that always call out to the kernel for randomness, this problem subsides as soon as the kernel’s entropy pool is properly initialized. On the other hand, a user-space CSPRNG that was seeded before the entropy pool was initialized will continue to be insecure unless it employs reseeding.1
I think that all of these concerns need to be addressed - and more research needs to be done to make sure that there are not more concerns that we should have - before we consider maintaining our own CSPRNGs.
1 It may sound like I’ve just contradicted myself here - didn’t I just say that no CSPRNG should require reseeding? The problem is that kernels which do not support blocking until their entropy pools are initialized (see the second linked paper for details on exactly what this means) provide insecure randomness. Thus, while a CSPRNG that is seeded with a cryptographically-secure seed should never need to be reseeded, a CSPRNG whose seed is acquired before the entropy pool is initialized is an exception because its seed is not cryptographically secure. In other words, the CSPRNG didn’t screw up, the kernel screwed up, and now we’re left to try to patch things over.
To be clear, this is only a problem on systems that do not support blocking calls for entropy until the entropy pool is initialized. The primary culprit here is old versions of Linux - new versions support the
getrandom syscall, which does block properly.
This is a dubious usage of random number generators, if the fuzzer is supposed to test handling for special values. E.g.
f64 encodes +0 when all 64 bits are zero (and -0 with first bit only 1); the chance of this is 1 in 2^64 or approx. double that,
10^-19, chance of generating any 0 (+ or -); i.e. you could run the fuzzer a billion (10^12) times and still only have a 1-in-ten-million chance of generating 0. And there’s other special values you might want to test too; +/- infinity (also 63 fixed bits) and NaN (this one is much more likely, since it only fixes the exponent bits and requires non-zero fraction part).
Edit: I’m using the old-world billion of 10^12 here not 10^9. Not that it matters much, practically speaking.
Thanks for the links @joshlf; you provide a very good argument against usage of user-space PRNGs by default.
It might therefore make sense if
thread_rng() just uses
OsRng (which, on some platforms, requires initialisation just to work out how to access the OS rng), and let programmers who really want a user-space PRNG set up their own.
For cryptographic applications, this seems fine (to me; I may be missing something). For anything requiring reproducibility, there is a requirement to handle initialisation of the PRNG yourself anyway. For other non-cryptographic uses (e.g. randomised algorithms and simple games), this story is a little unfortunate, since each random number now needs a system call, unless the programmer goes to the effort of setting up a PRNG. This isn’t really that difficult, however.
Looking through the
rand PRs, there are a few other features/issues worth looking into:
- Performance of
thread_rng()— is this important? Possibly when used in randomised algorithms.
- Dependency injection for
thread_rng()— should there be a standard way for binaries to influence the way libraries get random numbers? Note that this is probably not enough for reproducibility. It also assumes some knowledge of whether any libraries used require computationally-secure random numbers or not — possibly
thread_rngshouldn’t try to serve both cases.
- Serialisation of PRNGs — apparently some games and procedural world generators want this. Obviously
ReseedingRngcannot usefully support this, but most PRNGs could. Should this feature be added to the
Rngtrait? Perhaps it should be a separate trait, in which case PRNGs could just implement serde’s
Deserialize, but as a crate feature (optional dependency)?
- Some implementations of
OsRngcan fail, resulting in a panic. Should OS-RNG functions return a
Result<...>instead? On the other hand, PRNGs don’t need and don’t want this.
no_stdusers want some subset of
randfunctionality, but this cannot include floats or (?)
- modern Intel CPUs have an
RDRANDinstruction, but this is not universal and probably needs approval for cryptographic applications
Perhaps cryptographic users should use the
ring crate instead, and
rand should be focussed on numeric (non-cryptographic) applications, but this doesn’t play well with randomised algorithms, which want a convenient way of getting a random number, where a user may want to override the source of randomness, and in some uses computationally secure random numbers may be important. So perhaps
thread_rng should be more like
log: allow a user to set a custom rng (but with a sensible default: probably
So there is a native floating-point PRNG: dSFMT. In the slides attached there are sometimes quite large speed-ups over conversion from integers, depending on the CPU used.
This suggests there may be benefit in keeping
Rng, however there are two drawbacks:
- A pure-FP generator is going to have to write inefficient conversions for
no_stdusers want to keep all FP operations out of
I would be tempted to remove
_f32 anyway, since I imagine there are very few users who would really care about the performance benefit of native FP generators.