Using std::collections and unsafe: anything can happen

jbe · May 17, 2022, 9:35am

Continuing the discussion from Avoiding unsafe when using a C library: LMDB on URLO:

I recently got irritated by the documentation of HashSet / HashMap and the like, and the Rust reference regarding "logic errors", and I see a fundamental problem in the documentation regarding "logic errors" and (un)safety.

The documentation of HashSet says:

It is a logic error for an item to be modified in such a way that the item’s hash, as determined by the Hash trait, or its equality, as determined by the Eq trait, changes while it is in the set. This is normally only possible through Cell, RefCell, global state, I/O, or unsafe code. The behavior resulting from such a logic error is not specified (it could include panics, incorrect results, aborts, memory leaks, or non-termination) but will not be undefined behavior.

The Rust reference says:

Logic errors

Safe code may impose extra logical constraints that can be checked at neither compile-time nor runtime. If a program breaks such a constraint, the behavior may be unspecified but will not result in undefined behavior. This could include panics, incorrect results, aborts, and non-termination. The behavior may also differ between runs, builds, or kinds of build.

For example, implementing both Hash and Eq requires that values considered equal have equal hashes. Another example are data structures like BinaryHeap, BTreeMap, BTreeSet, HashMap and HashSet which describe constraints on the modification of their keys while they are in the data structure. Violating such constraints is not considered unsafe, yet the program is considered erroneous and its behavior unpredictable.

Now I'm not entirely sure how to interpret this, but I would interpret it as follows: Anything can happen in these cases, except those things that require unsafe to happen.

Undefined behavior means anything can happen, but the "not specified" behavior basically means anything can happen except the things that can make anything happen (i.e. a proper/strict subset of all possible behavior).

While I understand the reason to make a difference here, this still makes reasoning about a program practically impossible, and in particular it makes it impossible to soundly use unsafe. Consider the following example:

mod i_am_not_unsafe {
    
    use rand::random;
    use std::collections::HashSet;
    use std::hash::{Hash, Hasher};
    
    struct Wobbly;
    
    impl PartialEq for Wobbly {
        fn eq(&self, _: &Wobbly) -> bool {
            random()
        }
    }
    
    impl Eq for Wobbly {}
    
    impl Hash for Wobbly {
        fn hash<H: Hasher>(&self, state: &mut H) {
            random::<bool>().hash(state)
        }
    }

    pub fn run() {
        let mut s = HashSet::<Wobbly>::new();
        for _ in 0..12 {
            s.insert(Wobbly); // anything but UB can happen here!!
        }
    }
}

mod i_am_innocent {
    pub struct Opaque(Vec<i32>);
    impl Opaque {
        pub fn new() -> Self {
            Opaque(vec![1, 2, 3, 4])
        }
        pub fn second_element(&self) -> i32 {
            // Should be sound, because we know `self.0` has four elements:
            *(unsafe { self.0.get_unchecked(1) })
            // But how can we be really sure? It is unspecified what
            // `s.insert(Wobbly)` can do! It might cause other `Vec`s losing
            // elements, for example (e.g. because `HashSet` and `Vec` could
            // share a thread-local state, which gets corrupted by
            // `s.insert(Wobbly)`).
        }
    }
}

fn main() {
    let opaque = i_am_innocent::Opaque::new();
    i_am_not_unsafe::run(); // anything but UB can happen here!!
    println!("{}", opaque.second_element());
}

(Playground)

I am aware that this is a rather hypothetical problem, and will not pose any issue in practice. Moreover, the Rust reference is non-normative. (Not sure about the documentation of std though.)

Nonetheless, I believe that both the std doc and the Rust reference should specify more precisely which errors may happen when "logic errors" are made in the implementation of Hash and Eq when using the data structures provided by std::collections.

Given the Playground example above, the module i_am_not_unsafe would make reasoning in the module i_am_innocent pretty much impossible, thus formally leading to UB (even if that won't happen in practice).

If instead, the documentation provides a finite list of undesired effects that may happen in case of these "logic errors" regarding the use of HashSet, then module i_am_innocent can be sound. Such finite list might look as follows:

Edited version

It is a logic error for an item to be modified in such a way that the item’s hash, as determined by the Hash trait, or its equality, as determined by the Eq trait, changes while it is in the set. This is normally only possible through Cell, RefCell, global state, I/O, or unsafe code. The behavior resulting from such a logic error is not specified ~~(it could include panics, incorrect results, aborts, memory leaks, or non-termination) but will not be undefined behavior.~~ but will be one of the following:

deadlocks / non-termination

panics

aborts

memory or other resource leaks

retrieval of wrong values / wrong behavior regarding the contents of the data stucture

It is guranteed that no other instances of the same collection type or other types are affected.

The reference should be updated accordingly as well.

For a genesis of my considerations refer to the thread on URLO linked in the top of this post.

steffahn · May 17, 2022, 10:42am

I suppose, this is the key point, and I feel like you’re completely right in saying something like this should be added. An exhaustive list of possible effects is not really necessary, at least for your code example.

Naturally and to be precise, it is of course possible that other instances of the same collection / etc… are affected, but only if the behavior of the Hash or Eq implementations or the Hasher are such that shared or global state is modified.

Regarding an explicit list of possible effects, I would actually agree that that’s – whilst being a separate/different request/idea – also a reasonable thing to ask for; I do believe it’s possible to give a list of possible effects without restricting future algorithm design decisions. I’d also question whether or not memory leaks are supposed to be okay. (All of course excluding effects of the Hash/Eq/Hasher themselves which can cause memory leaks, etc… as long as they do it directly.) AFAIK, memory leaks are usually only considered acceptable behavior if they’re explicitly created, created via (reference) cycles, or a consequence of other memory leaks (e.g. pre-poop-your-pants style); I’m not sure what’s gained by including logic errors into this list.

Of course, any algorithm relying on the reasonable behavior or a logically erraneous HashMap itself can be buggy due to the erraneous HashMap and thus cause any cause of misbehavior, including – if unsafe code is involved – unsoundness / UB. But that’s a different thing.

2e71828 · May 17, 2022, 11:20am

I could imagine an array-based collection losing track of whether or not a slot is occupied due to a logic bug in the comparison function. If this happens, the only safe thing to do is deallocate the slot without calling a destructor.

simonbuchan · May 17, 2022, 11:30am

I'm not sure how this could happen and not be a safety hole?

steffahn · May 17, 2022, 11:30am

This assumes the collection notices that it’s “losing track of” stuff “due to a logic bug”, and hence chooses a safe approach. In particular it must notice the logic bug in the first place.

As far as my imagination goes, logic errors have unpredictable results precisely when algorithms rely on the properties of, say, a comparison function without being able to verify those properties, and without being able to tell when such a logic bug happened, which would then mean that (for soundness) whether or not something needs to be deallocated can’t be tracked using the behavior of the comparison function, because there’s no way for the collection to detect the logic bug. TL;DR, I can’t imagine any (more or less) concrete scenario where a collection can prevent double-frees but can’t prevent memory leaks.

simonbuchan · May 17, 2022, 11:31am

Jinx, you owe me a coke.

jbe · May 17, 2022, 11:42am

I believe that:

If a "logic error" causes behavior that isn't clearly bounded to a certain set of things that may happen, you cannot use unsafe anywhere soundly (because I don't see how you could reason about any preconditions being true).

Edit: Of course, formally "anything but UB" is such a set, but it's not sufficient as shown in the Playground example in my OP.

There might be some crate-level isolation, as it's impossible for one crate to thwart the guarantees of another crate if those crates don't use each other (and if there is no unsound use of unsafe), right?. But if std claims that anything but UB can happen, then this would certainly include effects on std::Vec (unless stated otherwise).

I'm not sure if it's possible to state something like "this only affects the instance of the HashSet and not any other values in your program" without giving a finite list of things that are covered. I feel like not having a finite list makes reasoning about code much more hard (if not impossible).

steffahn · May 17, 2022, 11:51am

I guess my main point is that the “gurantee[…] that no other instances of the same collection type or other types are affected” is the most important point^[1] and the least controversial, compared to the task of compiling a good exhaustive list of things-that-could-happen^TM. As noted I agree that the latter does have good value, too, but the first alone would already be a significant improvement in the documentation, and it should be – as noted – completely uncontraversial. (Provided, it’s slightly rephrased to point out that side-effects of the Eq/Hash/Hasher implementation themselves can influence other instances of collections.)

assuming that all of this language is ultimately informal anyway, so I wouldn’t really let an argument of “as long as you don’t provide a finite list, anything can happen, hence this point doesn’t achieve anything at all” count, even if it might be logically correct ↩︎

2e71828 · May 17, 2022, 11:53am

To be honest, neither can I. But I've (tried to) read enough algorithms papers to know that there are people much cleverer than I coming up with this stuff— Leaking memory to maintain safety in unusual situations feels like a tool that we shouldn't prevent them from using.

jbe · May 17, 2022, 12:14pm

I think whether it's uncontraversial depends on whether it's possible to find a precise wording for "anything related(?) to the HashSet instance can happen". It might be a more difficult task than being more concrete (but not sure really).

P.S.: Maybe it would help to propose a particular wording, so discussing about it is easier. But I feel unable to come up with one (mostly because I feel insecure with terminology in Rust, such as "instance", etc).

P.P.S.: And perhaps, if we find such wording, it is effectively the same as such a "good exhaustive list of things-that-could-happen™, even if the list might not be stated explicitly.

steffahn · May 17, 2022, 12:30pm

Maybe something about absence of and/or effects on “global state” might be enough

Though “shared mutable state” between different HashMap (beyond anything that’s part of keys, values, or hashers), e.g. one created as a clone from the other, should probably be ruled out, too, because that wouldn’t be “global”, but still problematic.

2e71828 · May 17, 2022, 1:12pm

How about something like this:

It is a logic error for an item to be modified in such a way that the item’s hash, as determined by the Hash trait, or its equality, as determined by the Eq trait, changes while it is in the set. This is normally only possible through Cell , RefCell , global state, I/O, or unsafe code.

If such a modification occurs, future behavior of the containing HashMap will be unpredictable and cannot be relied upon, with the following exceptions:

The HashSet will never directly perform any unsound actions.

The HashSet will never access, modify, or return values which have not been inserted into the set.

jbe · May 17, 2022, 1:49pm

I like it. (But you accidentally mixed HashMap and HashSet in the example.)

I have no idea if it is "formally" a 100% precise description of what can happen, but I would assume it's at least on the same level as most other documentation of Rust.

For clarity, a non-exhaustive positive list containing non-termination, panics, aborts and (optionally?) resource leaks could still be added. Something like:

If such a modification occurs, future behavior of the containing HashSet will be unpredictable and cannot be relied upon, with the following exceptions:

The HashSet will never directly perform any unsound actions.

The HashSet will never access, modify, or return values which have not been inserted into the set.

In particular, also panic, aborts, memory leaks, or non-termination may occur.

It ~~should~~ could be clear that the list isn't exhaustive.

kpreid · May 17, 2022, 3:33pm

One way to look at this problem is in terms of causality and encapsulation. The things which make UB uniquely bad are:

UB may cause an effect that is “action at a distance” from the source of it: it can arbitrarily modify any other part of the program state, so the consequences do not follow the laws of causality that programmers normally assume or that language and library documentation promises, such as (especially with Rust's single-ownership rules) a value not observably changing unless modified by its owner, a borrow, or interior mutability.
UB can violate invariants across abstraction boundaries: for example, corrupting the contents of private fields of a struct such that it no longer obeys the invariants maintained by the code to which those fields are visible. Narrowly, we can identify this as “encapsulation”; we could also consider it unwanted causality that bypassed the code that wanted to maintain the invariant.

So, I propose that where HashMap currently says “will not result in undefined behavior”, the spirit of the message is “will not violate causality or encapsulation”. We declare it will not violate causality, meaning that it does not affect what a programmer would consider unrelated data; we declare it will not violate encapsulation, meaning that even the things that it interacts with will not enter states that they cannot have entered merely by the HashMap returning values which are valid by its function signatures.

Hmm, that last clause suggests a more formal condition: The HashMap may do anything that is all of:

not-UB / memory-safe,
implied by its function signatures (return wrong values, panic, loop, etc.), and
does not otherwise affect any state (except by returning results or calling callback functions according to point 2) that would not be dropped when the HashMap is.

(And the usual broad exception for the memory allocator's amount of free or used memory as a piece of shared state.)

2e71828 · May 17, 2022, 4:25pm

There's quite a bit of cognitive dissonance in the statement that "The HashMap's behavior is unspecified, but it will not perform any Unspecified Behavior."

I used the term unsound to get around this problem; other options might be:

"compiler-UB" (to distinguish from library-UB, which we do have here)
"memory/thread safety"
"operations forbidden by the language"

On the other hand, "UB" might be strong enough jargon that this isn't an issue in practice.

cuviper · May 17, 2022, 4:31pm

UB is undefined behavior, which has a different formal meaning than unspecified.

jbe · May 17, 2022, 4:32pm

I'm not an expert, but to my understanding undefined behavior (UB) can cause any program behavior (by definition), while unspecified behavior requires further context and isn't well defined.

kpreid · May 17, 2022, 4:46pm

To clarify: when I wrote “not-UB”, I meant the acronym “undefined behavior” which I consider to have a clear-enough formal definition. The rest of my post was attempting to define a useful boundary around behavior which is not formal undefined behavior, but also is not specified to be anything in particular — useful in that programmers can actually write programs which avoid having problems caused by that mis-behavior.

jbe · May 17, 2022, 5:28pm

I use UB for "undefined behavior", which is the common abbreviation, I believe.

I would claim that such behavior (behavior "which is not UB but not specified to be anything in particular") isn't much helpful. If there's no clear definition on what can happen and what not, then it's UB (by definition). The only way out would be to say "anything but UB", which means "anything but at least one thing can happen", but the practical value of such definition is very limited, I think.

Therefore, in one or the other way, it needs to be specified what exactly is unspecified when Hash and Eq are implemented wrong and used in one of the std::collections that rely on them.

Whether that specification needs to be a finite list or some more abstract description, I'm not sure. But there needs to be at least some sort of clarity, as otherwise it would practically be the same as UB.

kpreid · May 17, 2022, 5:41pm

That's what I meant to propose: I wrote

along with a definition of the phrase, which is a very strong upper bound on what can happen, no matter what algorithm the HashMap (or any other type of interest) uses internally and how it goes wrong.

Topic		Replies	Views
Terminology around unsafe, undefined behaviour, and invariants Unsafe Code Guidelines	39	2877	July 24, 2020
Pre-RFC: Unsafe Types	28	7630	March 25, 2019
Unsafe categories	18	1430	December 14, 2021
Overflow checks and unsafe code Unsafe Code Guidelines	10	1794	March 25, 2019
[Pre-RFC] Another take at clarifying `unsafe` semantics	41	4109	March 25, 2019

Using std::collections and unsafe: anything can happen

Related Topics