Using std::collections and unsafe: anything can happen

jbe · May 18, 2022, 12:30pm

I don't know Rust's foundations well enough to understand if it's easily possible to correctly describe this "side-effect-free" behavior (i.e. "anything but side-effects can happen"). It would also require the programmer to understand (and properly overlook) what that means.

It might be easy in a purly functional language to do it (like Haskell), but not so sure about Rust.

Hence, I previously proposed that maybe it's better to simply provide a finite/exhaustive list of what actually can happen:

However, I also see that this imposes restrictions on std regarding future implementation changes and backward compatibility, which would need to be considered carefully.

P.S.: On the other hand, a complicated explanation of what a "side-effect" is in this context might be error-prone as well.

bill_myers · May 18, 2022, 3:52pm

I think the collections should be way more strict.

In particular, when required invariants are broken, they should guarantee that they behave as a collection with correct invariants that is a subset of the actual collection.

In other words, they may behave as if the collection is empty, or appear to not insert an element even if they succeed, or lookups may miss items, etc. but they can't do arbitrary behavior like deleting all files on disk when none of the trait implementations can do that.

jbe · May 18, 2022, 4:45pm

I'm not sure if there is big harm in allowing them to cause non-termination, panics, or aborts. Afterall, that's what execution of Hash::hash or Eq::eq can do anyway if those implementations are bad. (Playground). (Edit: Of course you could say the same about deleting all files on disk , but I still think that non-termination/panics/aborts is something different.)

And, by the way, this is what any function in Haskell can always do: return Bottom.

Allowing non-termination, panics, and aborts because of inconsistent implementations would give future versions of std more freedom while not really being an issue for the users of std::collections.

Very good point! According to the current documentation, all files on disk could be deleted. But at least the RAM is safe.

bill_myers · May 18, 2022, 5:00pm

Do they really need to cause non-termination, panics or aborts?

It seems to me that the straightforward implemention of btreemaps and hashmaps will just sometimes miss items or insert them in the wrong part of collection if Ord/Hash/Eq are mis-implemented, resulting in the described "behave as a subset" behavior.

Although I guess it might be necessary to also specify that asymptotic time bounds may be relaxed to linear time in some cases (e.g. a btree might perhaps fail to be balanced with incorrect Ord implementations), but it would be nice to avoid that as well.

And also of course invariants that rely on trait implementations like "this iterator will return items in order (according to Ord)" or "this iterator has no duplicates (according to Eq)" will become meaningless if the trait is mis-implemented (instead in the first case the iterator will return a subset of items in arbitrary order).

programmerjake · May 18, 2022, 5:47pm

that means if the HashMap gets confused and leaks its contents, then it's not allowed to have modified what it leaked?!! seems not quite right.

jbe · May 18, 2022, 8:59pm

I would like to note that forbidding panics on inconsistencies would make it impossible for std provide an defensive implementation (e.g. in when compiling in debug mode).

But even if no overly defensive implementation is used, erroneous behavior of eq, hash, and cmp might be catched accidentally in some cases and reporting those cases might come with little extra cost.

Also compare with how integer overflows are handled in Rust: It is allowed but not mandatory to panic. Incorrect results are also allowed (but no UB, and no arbitrary non-UB behavior either like deleting files on the harddrive^[1]).

If we don't consider this as being (practically) equivalent to UB as suggested previously here. ↩︎

bill_myers · May 18, 2022, 10:19pm

Actually I think it's even better to just fully specify the behavior of collections regardless of how the traits are implemented.

Something like this:

HashMap behaves like an indexed array of buckets consisting of an array of (key, value) pairs and works by computing the hash, applying a pure closure conceptually stored in the HashMap to get a set of bucket numbers, then operates only in those buckets, checking the key for equality with the keys in the buckets in an arbitrary order until one returns equal or all return non-equal using Eq in a specific way (e.g. new_key == stored_key); furthermore before or after inserts or removes it may decide to rehash, meaning that the hashes of an arbitrary subsets of items are recomputed and placed again in buckets and before that the pure function may be replaced in such a way that the last computed hash of every item still maps to the bucket it is in and possibly other buckets
BTreeMap behaves like an array where lookups and inserts are made by binary search with arbitrary pivots, using Ord and Eq in a specific way to do the search, and iteration is done like iterating the array

cuviper · May 18, 2022, 11:07pm

There's a danger in overspecifying the collections, as that may prevent us from choosing different implementations in the future -- like we did going from the old Robin Hood HashMap to the current SwissTable hashbrown implementation.

scottmcm · May 18, 2022, 11:41pm

+1 to this. C++'s unordered_map, in retrospect, promised more than it should have and as a result it can't be as performant as people would like.

tczajka · May 19, 2022, 12:20am

However, the C++ names are more neutral. The names BTreeMap and BinaryHeap describe a specific implementation, OrderedMap and PriorityQueue would have been better IMO. Especially for BinaryHeap, there are other implementation strategies possible (such as B-ary heaps with B > 2) that could be more performant, and then being stuck with the old name will be somewhat awkward.

CAD97 · May 19, 2022, 12:34am

Just a bit of terms definition:

UB is a term of art. It's a bad name, and basically everyone involved in compilers agrees. It doesn't mean "undefined behavior" as in "behavior which is not defined" anymore. Even in the C standard, UB is defined roughly as "behavior on which this standard places no restrictions," whereas cases which are merely not mentioned fall under unspecified behavior.

Unspecified behavior is the one somewhat close to the informal meaning, as it means something similar to "any valid behavior of the abstract machine," though in practice it is often limited to taking an unspecified choice of a given set of behaviors.

UB, however, is the mathematical contradiction. UB cannot happen, because it is not a thing that exists. If UB "happens", you no longer have a valid execution of the source language, and all restrictions on the program behavior never existed in the first place. This is why UB can time travel.

In a short overly simplified summary, unspecified behavior puts no (or limited) restriction on the execution of the unspecified behavior. UB puts no (and I mean no) restriction on the entire behavior.

Is this what the original C specification authors had in mind when they first listed some behavior as undefined? Some will say yes, some will say no, but this is the status of the current understanding, and how 99% of compiler authors/scholars/philosophers communicate about this.

Now what this issue is about is that std::collections's behavior is in fact unspecified without bound in the face of misbehaving implementations of core safe traits. And because the encapsulation boundary of std::collections is in fact all of std, said unspecified behavior could do arbitrary "safe" modification of other items inside std's encapsulation boundary (via global state or otherwise) which could invalidate any assumptions about correctness consumer unsafe code makes, for any std item.

The correct solution is to bound the range of unspecified behavior. The minimal and "correct" way to do so is to specify that the unspecified misbehavior of an item is limited to other items/behaviors which are derived-from the misbehaving item. The difficult part is then to actually define which items are derived-from.

My first draft, and I think a reasonable upper bound, would be to say roughly that

The behavior which results from such a logic error is not specified, but will not result in undefined behavior; all safe functions remain safe to call. Misbehavior resulting from a logic error is restricted to just the collection which experienced the logic error, any values holding a borrow of the collection, and any values later produced by misbehaving values. Misbehavior can include (but is not limited to) panics, incorrect results, aborts, memory leaks, and non-termination, but does not include any behavior considered undefined.

This ensures that a HashMap misbehaving does not cause a Vec to misbehave, nor another unrelated HashMap, but allows everything that even potentially observes the misbehaving value to misbehave.

Individual collections can then put further bounds on misbehavior, if so desired.

tczajka · May 19, 2022, 12:42am

I disagree. "Behavior which is not defined by this standard" and "behavior on which this standard places no restrictions" are synonyms, they mean the same exact thing. If you don't provide any definition for a behavior of a language construct, you haven't placed any restrictions on the behavior. Once you place some restrictions, those restrictions constitute a definition of the construct.

The C++ standard even mentions both ways of saying this:

undefined behavior

behavior for which this document imposes no requirements

*Note [1]: Undefined behavior may be expected when this document omits any explicit definition of behavior or when a program uses an erroneous construct or erroneous data

Not true. For C++ this is very clear from the note quoted above. In the C standard it is also pretty clear. For something to be "unspecified" rather than undefined, the standard must define the set of possible interpretations, from which the implementation picks one. Here is the relevant quote from the C standard:

unspecified behavior

use of an unspecified value, or other behavior where this International Standard provides two or more possibilities and imposes no further requirements on which is chosen in any instance

If some language construct doesn't have a set of possible behaviors specified, how would you know that it's not going to be evaluated as "launch nuclear weapons"? You wouldn't. That is the essence of undefined behavior.

afetisov · May 19, 2022, 12:44am

That sounds overly broad and quite disturbing. I would expect that the encapsulation boundary of collections is, as usually in other crates, the module which defines them as a public item. At the very least that would mean that collections cannot affect other parts of std, but I would not expect them to affect each other either.

Is there a formal statement about encapsulation boundaries anywhere, or is it your conservative assumption?

scottmcm · May 19, 2022, 1:54am

Well, there could be a pub(crate) something that all the different parts of std use.

There's never going to actually be that, because it would be a horrible implementation strategy, but there's no fundamental rust rule that there couldn't be.

CAD97 · May 19, 2022, 1:58am

The module is the default privacy barrier, but unless you know the implementation of a crate, the crate is the encapsulation barrier. This is trivially true, because items can be defined as pub(crate), and there could be a pub(crate) static OBSERVED_MISBEHAVIOR: AtomicBool in std which is set whenever misbehavior is observed, and then everything checks that flag and potentially misbehaves if it's set.

Items within a crate can be arbitrarily reliant on each other in knotted and recursive ways. It's only the crate barrier which enforces a directed acyclic tree of dependencies (and the std/alloc/core split even gets to play funny with this acyclicity).

(Just to add: notes in the C++ standard are explicitly non-normative.)

In any case, where unmentioned states fall isn't core to the definition of UB. UB removes all restrictions retroactively from the past, (unbounded) unspecified behavior only effects the behavior itself, and is still bound to the definitional rules of the abstract machine.

The set of all possible transformations of abstract machine state is an uncountable set (or maybe it's a countable infinity? It depends on how you define abstract machine state) but still is "two or more."

The difference between UB and unspecified behavior in the abstract isn't whether they can result in calling extern fn launch_the_nukes(). The difference is in causality. Unspecified behavior does an unspecified (but defined within the abstract machine's semantics) set of abstract machine operations, and then continues operation of the abstract machine as normal. UB is so undefined at the abstract machine level that it literally goes back in time and retroactively removes any definiton/restriction of behavior of the executing program, as well as removing all definition/restrictions going forwards.

I agree that truly unbounded unspecified behavior makes it impossible to do anything relying on prior machine state, as the unspecified behavior could adversarially have put the machine in such a state that your operation causes UB, and there's plenty of very bad things it could do without leading to UB. But even in the face of this, there's still a meaningful difference between the two.

steffahn · May 19, 2022, 7:28am

It depends on how you define transformation, at least assuming that the set of abstract machine states is countably infinite (which seems the most reasonable to me). If transformations are allowed to be any mathematical function, then there's an uncountable number of them (more precisely, there's exactly as many functions between two countably infinite sets as there are real numbers). If transformations need to be computable functions, then there's only a countably infinite number of them.

I wouldn't be so sure about the time traveling aspect of UB you describe. Maybe I'm wrong, but the way you describe it seems hard to specify in an abstract model. Furthermore, the precise sequence of machine model states shouldn't be considered part of the overall program behavior anyway; rather its only the overall result (and possibly some bound on the time needed to get there) that matters, i. e. the output of the program. Otherwise optimizations couldn't really do anything at all. Well.. instead of "overall result", in an interactive program (which is how many programs are) I suppose each step from one interaction point to the next matters, considering the output and additional input exchanged in each interaction as program behavior. Now, what exactly an interaction is, is a different question, and I don't know how exactly that would be modeled.

Then some aspects of your described “time-travel” naturally emerges since the precise steps of program interaction doesn't matter; so once a program will inevitably reach UB, the steps towards that UB don't matter anymore, and optimizations are allowed to make "UB happen earlier" because that doesn't matter for overall program behavior. Nonetheless, I'd say that UB is simply one possible machine state you can reach and never get out of again. Or it's the absence of any machine state, in case machine semantics are partial functions; this should be a mathematically equivalent approach.

I'm not sure how UB and interactive programs work in cases where UB is almost inevitable except that there's some sort of potentially-blocking or program-terminating outside-world interaction. I'm imagining e. g. a 10s sleep, or an interaction with stdin or stdout. When the program waits 10s before reaching UB, and I promise to always kill the process after it ran for 9s or less, it should arguably never reach the UB, shouldn't it?

CAD97 · May 19, 2022, 7:51am

If a program, through some act of angelic premonition, knew that 100% execution is going to enter the UB state, that UB can time travel backwards across observable states. There truly is zero definition of any behavior of a program execution that encounters UB at any point.

In practice, however, all observable actions of the Abstract Machine (roughly: syscalls, extern fn, volatile read/writes) have a potential side effect of pausing execution of the abstract machine. This means that there is no possible way to reorder observable effects of the Abstract Machine (that are ordered according to the Abstract Machine), and that UB that occurs after an observable behavior cannot corrupt it or earlier observable behavior without actually violating causality.

(Talking about the state of the Abstract Machine is interesting, because you could be talking about internal state, or observable state (i.e. the digraph of observable behavior).)

And to be clear, I'm describing the result of UB. The formal definition of UB is trivial in practice, as it is just Undefined in the mathematical sense of "it doesn't happen." There isn't any formal model of what it means for UB to happen, because it's definitionally outside of the model.

Trying to describe the "result" of UB is fraught with danger, because it necessarily is mixing abstraction layers.

Anyway, I think this is far enough off topic for today

simonbuchan · May 19, 2022, 8:37am

The "time travel" talk comes from optimizers at least historically doing things like treating statements like if A { B } where B has undefined behavior as proof that A was false, and thus that any code paths that lead to A as being unreachable and removable - that's undefined behavior changing the behavior of code that runs before it.

There's some great examples of this sort of thing here: Undefined behavior: what happened to my code?

jbe · May 19, 2022, 8:48am

I should have named the topic:

Using std::collections and unsafe: Can time-travel happen?

But seriously, I find it interesting whether undefined behavior (UB) and unspecified behavior is the same if you don't put any bounds on it (edit: except that maybe unspecified behavior could be defined to not include UB by definition), and I believe it depends on the overall model you use for reasoning about your program.

I'm sure in some models, UB can time-travel, but in other models (those more close to what really happens in the machine) it can't. (Because I guess time-travel isn't possible in the real world yet, is it?)

If we want to write secure code, it doesn't matter whether UB and unspecified behavior is the same or different, assuming that the unspecified behavior is not specified what it includes, except perhaps being limited to affecting std or std::collections, see @CAD97's post:

This is pretty much what I wanted to say, but phrased in a nicer way.

I would agree, though I have to say I'm not an expert on these theoretic issues.

Regarding whether that set must be "finite" or "infinite", I think I made the mistake of demanding a finite list of what can happen. Actually it doesn't need to be a finite set in terms of math. But it needs to be a set (whether being infinite or not), which is big enough for std having freedom in how to implement the collections, and small enough for a programmer to be able to soundly use unsafe in their code – and to be able in practice and not just in theory to reason about their code being sound.

Currently, this weighing of opposite goals has been made 100% against the programmer and against program safety^[1], and in favor for future changes in std::collections. I understand the rationale behind this (it leaves room for manoeuvring to further develop the language and standard library), but at some point this needs to be revised.

except being formally not unsafe, which doesn't help though ↩︎

tczajka · May 19, 2022, 2:56pm

I believe that in the current implementation of BTreeMap (and really in any not too contrived implementation), regardless of how you implement Ord, there is actually always a linear order between elements (in the map, plus any key being queried during lookups) according to all the comparisons that have actually been made.

In other words, regardless of how you implement Ord, there will never be three calls to Ord::cmp made by BTreeMap that result in A < B < C < A. It will simply never make that many comparisons, because the last of these would be redundant. After the implementation has already figured out that A < B < C, it never has a need to now compare C vs A, so it doesn't do that. And thus a consistent linear order is always maintained.

Perhaps this is how this should be specified?

Topic		Replies	Views
Terminology around unsafe, undefined behaviour, and invariants Unsafe Code Guidelines	39	2743	July 24, 2020
Pre-RFC: Unsafe Types	28	7550	March 25, 2019
Unsafe categories	18	1321	December 14, 2021
Overflow checks and unsafe code Unsafe Code Guidelines	10	1675	March 25, 2019
[Pre-RFC] Another take at clarifying `unsafe` semantics	41	3910	March 25, 2019

Using std::collections and unsafe: anything can happen

Related Topics