Using std::collections and unsafe: anything can happen

steffahn · May 20, 2022, 5:30pm

In case of bad Eq implementations, the concept of whether or not a value is contained in the map/set doesn’t make sense in the first place. So it’s impossible to judge whether or not new elements were invented, since that requires a notion of one value (the element or key you query for) being the same as another value (the elements or keys already present in the collection).

jbe · May 20, 2022, 5:33pm

A new attempt:

It is a logic error for an item to be modified in such a way that the item’s hash, as determined by the Hash trait, or its equality, as determined by the Eq trait, changes while it is in the set. This is normally only possible through Cell, RefCell, global state, I/O, or unsafe code. The behavior resulting from such a logic error is not specified ~~(it could include panics, incorrect results, aborts, memory leaks, or non-termination) but will not be undefined behavior.~~ but will be one or more of the following:

deadlocks / non-termination / poor performance

panics

aborts

memory or other resource leaks

inconsistent behavior regarding the contents of the collection

Note that unsafe code must never rely on the consistency of the contents of the collection when proper implementation of Hash and Eq isn't guaranteed, as it would be unsound.

tczajka · May 20, 2022, 5:37pm

Oh sorry, I misread your code.

Here is what is going on with your code, it is perfectly consistent to me:

You have a BTreeMap with one Wobbly key.

You look up another Wobbly key, getting either a Some(()) or None, depending on whether your random Ord for Wobbly returns Equal or not for the two Wobblies.

Then, 30 times, you create additional Wobbly keys and look them up as well, each time getting either Some(()) or None randomly, depending on whether they compare equal or not.

Your unsafe code assumes that either all Wobblies will be found or they will all not be found.

But why? Are you saying that they are all identical keys? Based on what? Just because they are ZST? That doesn't mean they are equal.

They are not all identical according to your Ord implementation -- Ord decides randomly whether they are equal or not. So, as expected, some are equal, others are unequal.

So, it all works as expected.

simonbuchan · May 20, 2022, 6:21pm

The assumptions would be more like "if I insert an item, then it should be there later. If I remove an item, then it shouldn't be there later. If I add more items than I remove, the container is not empty"

Consider a mapping of newtyped cookies into resources in an FFI, where you drop some common resource in that FFI when you drop the last item.

cuviper · May 20, 2022, 6:52pm

If you insert an item with an inconsistent Ord, then it will be there, but you might not be able to find it. So contains might be false, remove might fail, and another insert might add a duplicate.

tczajka · May 20, 2022, 7:27pm

The way I see it is a bit different.

Once you insert something into a BTreeMap, it is owned by the map, so you no longer have direct access to that object. Therefore, you can't even call any of these methods with that object as an argument! It's impossible to ask the question "does the set contain this object" because you no longer have that object at hand!

What you can do is to call contains, remove, etc with some other object that you consider equal. E.g. maybe you made a clone ahead of time, or you create another one that looks similar.

But now it is important what "equals" means in this context. And I say that what it means is defined by what Ord says about this!

If Ord::cmp queries place this object in the same position between other elements, then it is equal, and it will be found. If they place it in some other position, then it is not equal by this definition, and it will not be found, but it should not be found, because it's not the same key according to what Ord says!

The only remaining question is what happens when Ord gets itself into a contradiction. I have been arguing that BTreeMap doesn't even ask questions that could potentially lead to such a contradiction, so that never happens. @scottmcm wants to allow some such questions for debugging purposes etc, and then there need to be some additional rules about how they are handled.

Edit: Technically you can get a reference to a key owned the map and call map methods with that. But you can't call insert or remove with such a reference, because that would violate reference rules. You could call contains, but it's a bit pointless, since you've already found it.

jbe · May 20, 2022, 7:50pm

I think in either way, the collection will behave inconsistent in regard to the (observable) contents of the collection. Thus we should demand that "inconsistent behavior regarding the contents" is a possible consequence of a bad implementation of Eq/Hash/Ord. And I would assume that, panics, non-termination, and resource leaks aside (where allowing those isn't really a big deal considering an already existent buggy implementation for comparison/hashing), that's already enough to cover any reasonable implementation.

I don't think it's reasonable to call any of these collections "consistent" (or "working properly" as said earlier) if the trait implementations are buggy.

Now my question is: Is this wording defined well-enough:

If yes, then it might solve the problem.

tczajka · May 20, 2022, 7:54pm

I don't think it's defined well enough because I could interpret this in many different ways. For instance, does this mean that tree.len() == tree.len() can fail?

jbe · May 20, 2022, 7:56pm

Yes, that can fail, as there is no consistent view in regards to the contents: Elements may randomly be contained by the collection, or not. There might be even infinite elements inside the collection (thus that an iterator is never exhausted), and in the next moment no elements are inside at all. That's what I meant with "inconsistent behavior regarding the contents".

P.S.: I don't want to say this is the most restrictive set of possibilities, but I think it's reasonable to cover all potential implementations (and gives enough guarantees for programmers to soundly use unsafe in their code or and count on no files on the disk being deleted randomly).

tczajka · May 20, 2022, 7:59pm

Then it becomes super hard to reason about trees for types over which you don't control the Ord implementation, so you don't know whether it satisfies the properties or not (e.g. generic code). I don't think that's a good approach.

jbe · May 20, 2022, 8:03pm

Yes, I wouldn't use unsafe code that assumes a collection behaves in a certain way (in regard to its contents) where the type argument of the collection implements Eq/Hash/Ord wrongly.

Same as I can't rely on == working properly in my unsafe code if no collection is involved but the type I compare uses a wrong (but sound) Eq implementation.

tczajka · May 20, 2022, 8:05pm

Any type is allowed to implement Eq / Hash / Ord wrongly, because those traits are safe. So what you're saying, you would never use unsafe code with collections of types you don't know about. That seems extremely restrictive.

jbe · May 20, 2022, 8:05pm

Not if my unsafe code's soundness depends on proper implementation of these traits (or consistent behavior of the collection), yes.

One could imagine a new unsafe trait like TrustedEqHashOrd (akin to TrustedLen) if that's really an issue. But not sure if it's really needed. I would probably just refrain from using unsafe dealing with such data types (or provide my own unsafe trait as a marker trait to maintain an overview).

tczajka · May 20, 2022, 8:12pm

But whether it relies on that or not is precisely what we're discussing here -- what do collections guarantee, really. If they don't even guarantee tree.len() == tree.len(), then it's going to be super-dangerous to even touch such collections in unsafe code, because it's very easy to make such a mistake and assume that tree.len() == tree.len() is true.

But it doesn't have to be like this -- I don't see a reason why these collections can't guarantee this and other standard properties, regardless of what Ord does, along the lines I have described it in this thread. For instance, if BTreeMap were to never ask redundant questions, all these standard properties would be automatically guaranteed. Even if they do ask redundant questions, it's not that hard to specify how contradictions may be handled in some consistent way.

jbe · May 20, 2022, 8:18pm

My original intent was not to allow unsafe code being able to work with collections over types that have broken Eq implementations; instead, I want to be able to use unsafe soundly at all. So my goals are a bit less ambitious.

I think it's okay if using unsafe is super risky, as long as it's possible to use it at all. I have been stumbling upon issues where I thought my code was sound and then later my code was shown to be pretty much unsound. Unsafe Rust really is nasty (and I think it's okay that it is).

Maybe it can be done; I don't overlook it really. But it's a second step that goes beyond what's necessary (and might limit std in regard to future changes, e.g. if there is a mistake made in the specification).

P.S.: Anyway, I find it legitimate to discuss these ideas, even if they go beyond solving the original problem.

jbe · May 20, 2022, 8:31pm

To clarify what "inconsistent behavior regarding the contents" means, a non-exhaustive list could be given there:

The behavior resulting from such a logic error is not specified but will be one or more of the following:

deadlocks / non-termination / poor performance

panics

aborts

memory or other resource leaks

inconsistent behavior regarding the contents of the collection, including but not limited to:

items randomly appearing or disappearing from the collections

iterators never ending and returning items indefinitely

method calls returning different results without changing the collection, such that x.len() != x.len()

…

Note that unsafe code must never rely on the consistency of the contents of the collection when proper implementation of Hash and Eq isn't guaranteed, as it would be unsound.

cuviper · May 20, 2022, 8:34pm

Yes, but it's also important to realize that the exact queries from the time you inserted it the first time may be different than the queries when you try to look it up later. Even if you haven't modified the map in-between, the insert itself may have rebalanced the map so it looks different the second time. So even if the Ord is perfectly deterministic for all possible pairs A and B, you may still end up in different places if that's not a total order.

tczajka · May 20, 2022, 8:48pm

Yes but that is totally fine AFAICT.

For instance, let's say you insert A, B, C, D. Ord says A < B < C < D.

Now you look up B' that looks very similar to B. B was never compared with D before.

Now B' gets compared with D, and Ord tells me: B' > D.

OK fine, so B' is not found. BTreeMap assumes the only total order consistent with what it has been told: A < B < C < D < B'. So Ord has (indirectly) told us that B' > B.

Well, so B' is not found, all is well and good!

If we had started the lookup by comparing B vs B', maybe Ord would have said Equal, but since that never happened, it doesn't matter what would have happened in that counterfactual scenario! From BTreeMap's actual point of view, everything is totally fine and it has no problem maintaining the data structure based on what it has actually been told.

cuviper · May 20, 2022, 9:06pm

Yes, as long as you don't have other code that says "B/B' is not there, so do an unsafe thing", using the assumption that Ord was well behaved. We're promising not to make such assumptions in std, and external unsafe code should consider the same.

I think we agree, it's just a matter of how to frame it.

CAD97 · May 21, 2022, 4:01am

Just like UB's most insidious behavior is looking like it works, I think "it works properly" should be a valid outcome of misbehavior. As a thought experiment (bot not one that necessarily works for a BTreeMap), consider that two logic errors could cancel each other out to give the "correct" answer in all cases, but for the wrong reasons. This is trivially possible by random() picking logically consistent answers, though showing a case where logically inconsistent intermediate answers results in logically consistent eventual answers is more difficult.

Topic		Replies	Views
Feature idea: Defining custom variants of unsafe language design	29	2341	June 15, 2021
Safe code relied upon by unsafe is effectively unsafe Unsafe Code Guidelines	22	1615	February 20, 2026
Libcollections traits: a standard interface for all collections libs	36	6073	March 25, 2019
Traits for common data structures libs	45	3531	July 1, 2021
[Pre-RFC] Abandonning Morals In The Name Of Performance: The Raw Entry API	52	13787	March 25, 2019

Using std::collections and unsafe: anything can happen

Related topics