Continuing the discussion from Avoiding unsafe when using a C library: LMDB on URLO:
I recently got irritated by the documentation of HashSet
/ HashMap
and the like, and the Rust reference regarding "logic errors", and I see a fundamental problem in the documentation regarding "logic errors" and (un)safety.
The documentation of HashSet
says:
It is a logic error for an item to be modified in such a way that the item’s hash, as determined by the
Hash
trait, or its equality, as determined by theEq
trait, changes while it is in the set. This is normally only possible throughCell
,RefCell
, global state, I/O, or unsafe code. The behavior resulting from such a logic error is not specified (it could include panics, incorrect results, aborts, memory leaks, or non-termination) but will not be undefined behavior.
The Rust reference says:
Safe code may impose extra logical constraints that can be checked at neither compile-time nor runtime. If a program breaks such a constraint, the behavior may be unspecified but will not result in undefined behavior. This could include panics, incorrect results, aborts, and non-termination. The behavior may also differ between runs, builds, or kinds of build.
For example, implementing both
Hash
andEq
requires that values considered equal have equal hashes. Another example are data structures likeBinaryHeap
,BTreeMap
,BTreeSet
,HashMap
andHashSet
which describe constraints on the modification of their keys while they are in the data structure. Violating such constraints is not considered unsafe, yet the program is considered erroneous and its behavior unpredictable.
Now I'm not entirely sure how to interpret this, but I would interpret it as follows: Anything can happen in these cases, except those things that require unsafe
to happen.
Undefined behavior means anything can happen, but the "not specified" behavior basically means anything can happen except the things that can make anything happen (i.e. a proper/strict subset of all possible behavior).
While I understand the reason to make a difference here, this still makes reasoning about a program practically impossible, and in particular it makes it impossible to soundly use unsafe
. Consider the following example:
mod i_am_not_unsafe {
use rand::random;
use std::collections::HashSet;
use std::hash::{Hash, Hasher};
struct Wobbly;
impl PartialEq for Wobbly {
fn eq(&self, _: &Wobbly) -> bool {
random()
}
}
impl Eq for Wobbly {}
impl Hash for Wobbly {
fn hash<H: Hasher>(&self, state: &mut H) {
random::<bool>().hash(state)
}
}
pub fn run() {
let mut s = HashSet::<Wobbly>::new();
for _ in 0..12 {
s.insert(Wobbly); // anything but UB can happen here!!
}
}
}
mod i_am_innocent {
pub struct Opaque(Vec<i32>);
impl Opaque {
pub fn new() -> Self {
Opaque(vec![1, 2, 3, 4])
}
pub fn second_element(&self) -> i32 {
// Should be sound, because we know `self.0` has four elements:
*(unsafe { self.0.get_unchecked(1) })
// But how can we be really sure? It is unspecified what
// `s.insert(Wobbly)` can do! It might cause other `Vec`s losing
// elements, for example (e.g. because `HashSet` and `Vec` could
// share a thread-local state, which gets corrupted by
// `s.insert(Wobbly)`).
}
}
}
fn main() {
let opaque = i_am_innocent::Opaque::new();
i_am_not_unsafe::run(); // anything but UB can happen here!!
println!("{}", opaque.second_element());
}
I am aware that this is a rather hypothetical problem, and will not pose any issue in practice. Moreover, the Rust reference is non-normative. (Not sure about the documentation of std
though.)
Nonetheless, I believe that both the std
doc and the Rust reference should specify more precisely which errors may happen when "logic errors" are made in the implementation of Hash
and Eq
when using the data structures provided by std::collections
.
Given the Playground example above, the module i_am_not_unsafe
would make reasoning in the module i_am_innocent
pretty much impossible, thus formally leading to UB (even if that won't happen in practice).
If instead, the documentation provides a finite list of undesired effects that may happen in case of these "logic errors" regarding the use of HashSet
, then module i_am_innocent
can be sound. Such finite list might look as follows:
Edited version
It is a logic error for an item to be modified in such a way that the item’s hash, as determined by the
Hash
trait, or its equality, as determined by theEq
trait, changes while it is in the set. This is normally only possible throughCell
,RefCell
, global state, I/O, or unsafe code. The behavior resulting from such a logic error is not specified(it could include panics, incorrect results, aborts, memory leaks, or non-termination) but will not be undefined behavior.but will be one of the following:
- deadlocks / non-termination
- panics
- aborts
- memory or other resource leaks
- retrieval of wrong values / wrong behavior regarding the contents of the data stucture
It is guranteed that no other instances of the same collection type or other types are affected.
The reference should be updated accordingly as well.
For a genesis of my considerations refer to the thread on URLO linked in the top of this post.