Using std::collections and unsafe: anything can happen

My proposal could be rephrased as follows then:


P.S.: I just realized that if it isn't specified in which context panics may happen (i.e. if they could later happen throughout any other method/function in std, unrelated to the affected collection), this could affect soundness of some drop handlers (or code that runs after catch_unwind) :frowning:, as some code might rely on the fact that there is no panic happening in some part of the program, right?

1 Like

Let me try to fix this:

The behavior resulting from such a logic error is not specified but is limited to the following:

  • non-termination or degraded performance
  • memory or other resource leaks
  • program aborts
  • panics when invoking methods or functions that depend on the contents or change the contents of the collection,
  • inconsistent behavior regarding the contents of the collection, including but not limited to:
    • items randomly appearing or disappearing from the collection
    • iterators never ending and returning items indefinitely
    • method calls returning different results without changing the collection, such that x.len() != x.len()

The behavior will not be undefined behavior, but note that unsafe code must never rely on the consistency of the contents of the collection when proper implementation of the necessary traits isn't guaranteed, as it would be unsound.

I thought, all of the misbehavior is only for functions that depend on the contents of accesses the collection, and only happens during the execution of such functions. After a function returns, if it even returns, the collection might be modified in surprising ways (though that would be unobservable anyways due to the unspecified behavior of all ways to inspect the collection), but the behavior itself will have ended, and nothing more can happen until you call another function on a collection in a logic error state.

I. e. not only panics but memory leaks and non-termination or aborts also only happen when/while executing such methods.


What are "other resource leaks"?

I don't like "randomly appearing or disappearing", I think I might prefer something like "unpredictably" or "in an unpredictable manner" something like that, because "random" could be interpreted as a technical term implying random number generation, and the collection is certainly not going to be introducing a random number generator. Although... the internals of hash maps might even feel like a lot of randomness..

I think it's unfortunate to name mostly examples of things that are actually impossible in current implementations, like the inconsistent lengths or non-ending iterators. Not that we'd necessarily want to promise that this won't happen; but more focus on kinds of behavior that can happen currently would probably make this easier to understand.

As noted above, "items appearing or disappearing" is a notion that only really makes sense if you have a good Eq implementation.

These are very few and oddly specific things listed, I don't know what to think of this. Maybe "for example" would be better that "including but not limited to".

The inconsistent len() (without calls to &mut self methods in between) should be impossible if we specified side-effect-freeness, or we specified behavior "as if" there's no interior mutability in the collections (beyond what keys or values might include themself). To see this, note that len() doesn't have any trait bounds on K or V so it can't call any trait methods; so even interior mutability in the keys or values can't be relevant.

Clarifying that this is the case is the main point of this discussion :slightly_smiling_face:

As currently documented, no such bound is placed on misbehavior, and the only bounds to misbehavior is implicit QOI.

As of now, I would believe that is the case. But there was the argument to keep room for future (different) implementations. I attempted a conservative approach which only fixes the problem of unsafe code becoming unsound.

Regarding panics, we need to specify that these may not happen at any point in time (because of unwinding and code that relies on certain actions to be executed together or not at all), but it would be alright if a future implementation messes up some internal state that is shared beyond the particular collection, which then could later lead to program abort at an arbitrary time later. Only if we allow it to panic (with unwinding), we would run into the problem.

Hence I was conservative here and would give room to future implementations to cause an abort later (e.g. because some resources are exhausted, see also next paragraph).

Maybe a future implementation of one of the collections will do file-access to support very large datasets, or it will hold some System-V IPC resources (or quantum-computing resources?) for whichever reasons. I know that's very far fetched, but it leaves room for future changes of the implementation (that was my intention at least).

Don't get me wrong, I'm happy if the misbehavior is narrowed down further. I'm just not knowing these collections (and usual and possible future implementations) well enough to judge about it.

Hmmmm, interesting point. I often forget that things like these are possible:

use std::collections::HashSet;

struct NotHashable {
    _dummy: i32,
}

fn main() {
    let hash = HashSet::<NotHashable>::new();
    assert_eq!(hash.len(), 0);
}

(Playground)

In regard to my last proposal, I really saw it as an upper bound of what could happen.

Yeah, after reading your post, I agree these examples aren't good.

I think clearly specifying that the collections (uniquely) own their state and contents would resolve at least the worries of unrelated code affecting a collection. For example, any access to the Ts contained in a Vec<T> must go through it. If I have a usable &mut Vec<T>, then set.insert(Wobbly) can't possibly have even &-access to any T in that Vec, or any of it's other properties.

Notably that's a weaker promise than not having interior mutability. Just that of not sharing access in the way that Rc and Arc explicitly do.

2 Likes

Given the original example, I feel like fixing things in Vec is the wrong thing to do – it needs to be solved in HashSet. To name an example, we don't want to expect HashSet deleting files on disk due to a wrong trait implementation. The current documentation would allow that.

Besides, we would have to make statements on all sort of data types in std just to fix problems of how std::collections behave. The problem should be solved in std::collections, I believe.

You can have a good Eq and a bad Ord, with resurrect to Clone. That's the intuitive meaning I would say. That lets you say things like "inserting A can cause B to disappear, then removing C can cause it to reappear" when talking about how the lack of expectations is stronger than a user might expect.

Be careful to distinguish "does not have interior mutability" and "promises to behave as if it does not have interior mutability, regardless of violating logical conditions". Uniquely owning all state sits between those in terms of strength.

I believe being able to promise the latter does require that the code in question can be naturally factored out into a type that doesn't mention the generic parameters, but I'm not sure you can do anything with that.

1 Like

If I implement Hash so that it deletes random files on the disk, then HashSet will also delete random files on disk, and there is nothing you can do to stop it.

You could try saying "HashSet is onyl as broken or malicious as Hash is", but without a formal effect system that is a rather meaningless and unenforceable guarantee. Not that it matters that much anyway. You add an item to a HashSet, and your disk is wiped. Does it really matter to you whether it was a bug in Hash or an unspecified behaviour in HashSet? You just shouldn't run HashSet on untrusted types, if its correctness and side effects really matter to you.

This whole discussion looks like navel gazing. There is an infinite number of things some code could do, do you expect to specify them all? Do you enforce the same documentation requirements on every crate you use? On every module you write? How would that even look in practice?

Pretty much the only thing to take from HashSet's documentation is "here are the guarantees for well-behaved types, and for untrusted types you should never use HashSet, but at least if you mess up your program will fail in reproducible ways". There is no way to guarantee anything else when you're running untrusted code, and any other guarantees would be an undue burden on the evolution of stdlib containers.

1 Like

That is correct, but the problem is that HashSet is also allowed to delete random files on disk (or modify other Vecs) when Hash::hash does not cannot, e.g. because it's not implemented in std. This makes a non-unsafe implementation of Hash causing other unsafe code (edit: that relies on any behavior of std) to be unsound. And that's something we don't want to happen, I believe.

I think what I would like is that an impl Hash for some_module::SomeType isn't able to mess up how Vec behaves, for example. And that's a reasonable demand, I think.

So the problem is while Hash::hash may cause panics or delete files on disk, it should never be able to mess with std::Vec as long as it's not containing any unsafe code. But given the unspecified behavior of std::collections::HashSet, for example, it may mess up Vec. Hence the problem with unsoundness.

So the problem is that HashSet can do more harm than Hash::hash can do. In combination with unsafe, it can even lead to UB (which shouldn't happen if Hash::hash does not use unsafe at all).

Of course this is a theoretical problem, as we know the current implementation of HashSet doesn't. But it makes soundness checks formally impossible. It's really a documentation issue.

(But maybe I'm missing something here.)

I would agree if Rust didn't came with the concept of ruling out UB when all unsafe code is sound. I.e. in other programming languages[1] it wouldn't matter that much. But it kinda thwarts Rust's safety guarantees (or at least the ability to use them soundly, as defined by the reference).


P.S.: If we really want to allow HashSet to do "anything" due to a wrong Hash implementation, then Hash should be marked unsafe (which would cause a lot of problems). Currently, std fixes this by claiming that "anything but UB" can happen. But as shown in the discussion, there's no real practical difference between "anything" and "anything but UB" in this case.

P.P.S.: The problem would be less severe if HashSet wasn't part of std::collections but in a different crate, e.g. a third-party crate. That would still make it difficult to soundly write unsafe code that utilizes HashSet (or any other datatypes from that hypothetical crate), but it wouldn't affect unsafe code that utilizes std. And the latter is the problem here, I think.


  1. Compare the original thread on LMDB, where a wrong sort function has to be treated as UB. ↩︎

1 Like

As already mentioned, specifying the potential behaviours better might be required for writing unsafe code in a generic library that places some user-provided type in a collection. It would be correct to push the escalated requirement to your users as in:

// SAFETY: T **must** implement Ord
// consistently with all the requirements
// documented in ...
pub unsafe fn foo<T: Ord>(x: T) { }

However, that seems unfortunate if you just need the set to e.g. contain at least one element after an insert, though maybe that's already expecting too much: That the collections work properly as containers of the items they are handed, and just throw away any ordering- or identity-guarantees they may have had if the relevant trait implementations have some inconsistency.

We could be using different definitions, but I'd say a Mutex<T> also uniquely owns the T. Interior mutability or its lack is mainly significant when reasoning about something that is shared, so specifying there is no shared state makes it unimportant.

I agree that some clause on the scope of possible side-effects should be added for HashSet and others, but for that original example in particular, the assumptions you can make about an owned Vec across some arbitrary code that can't have a reference to that Vec is much more generally applicable.

If it helps, have you considered hashbrown, which std has been using for its HashMap for some time now?

Hmmm, maybe it's neither clearly HashSet's responsibility or clearly Vec's responsibility to document that a Vec won't arbitrarily lose elements because of something wrong that you do with HashSet, but a general responsibility of std to declare being "sane" in that matter.

I normally wouldn't even assume a module or crate does something bad like what I presumed in my OP, if it weren't for the explicit warning in HashSet and the reference. That is why I intuitively thought it should be fixed in HashSet's documentation.

And even if the documentation on Vec gets updated, it would still be contradicting to the warning in HashSet that anything but UB can happen. So the latter has to be fixed either way.

It's not really fixing the problem, as some crate I use (e.g. one with a module i_am_not_unsafe as in my OP) could use std::collection while another crate in the dependency tree (e.g. one with a module i_am_innocent as in my OP) could use unsafe. I can't fix this by not using std::collections myself. It would only be a solution if nobody (in my dependency tree) uses std::collections, or if nobody uses unsafe.

Thinking about all this further, I would like to say something about responsibilities:

  • It's the responsibility of the author of any unsafe code to make sure that this code is sound.
  • It's the responsibility of the author of any (safe or unsafe) code to correctly explain what the code does and what can go wrong is prerequisites aren't met.
  • It's generally fine for a module crate xxx to say: "If you do this, then anything weird could happen (we call this 'logic error'), but we're not using unsafe here, so at least it's no UB."
  • You should never have unsafe code rely on such module's crate's (xxx's) behavior.

The problem here is that module xxx is std in the given case, thus we cannot rely on std behaving in a certain way when we write unsafe code which uses std.

(But even if xxx is not std, then xxx would still be sort-of "dangerous" to use, and I would likely prefer a crate/module without such unspecified warning.)

So maybe a subtle change in the wording would be enough for the programmer to implicitly assume that std doesn't do unexpected things?

The behavior of a HashSet resulting from such a logic error is not specified (it could include panics, incorrect results, aborts, memory leaks, or non-termination) but will not be undefined behavior or arbitrarily affect unrelated data structures.

I know this is wording is kinda "sloppy", but maybe something like that (or a slightly improved version) would be sufficient, considering that not everything is always documented 100% precisely.

Also the reference could maybe reworded in a similar way.

I still think this kind of guarantee is completely meaningless, especially for std to make, since std isn't regular Rust code (e.g. it can use unstable features), so users don't know what limits safe std code has. Plus, it's not true that std doesn't use unsafe code. Saying "anything weird could happen" is essentially what UB means, and adding "except UB" doesn't help this, because it's a circular definition.

For instance: is HashSet, when provided with a bad Hash implementation, allowed to call private functions in my crate?

After all "calling other people's private functions" is not listed on the "behavior considered undefined" page.

The right solution for this is to define what can happen (and it should be a pretty short list of possibilities), rather than trying to rule out everything that cannot happen.

1 Like

No, UB has a specific meaning. Unbounded "anything can happen" makes writing sound unsafe code impossible, but so long as no unsafe follows "anything can happen," no UB occurs.[1]

UB means that you have violated the requirements of the Abstract Machine, and nothing can be asserted to be true about the execution or the state of the Abstract Machine.

You are correct that calling a private function is not in and of itself UB. However, calling a private function you do not have access to is a violation of encapsulation and cannot be done without breaking the rules of the Abstract Machine, thus manifesting UB.

std still has to follow the rules of the Abstract Machine. These rules aren't formally defined (there is no formal specification), but even with unstable features, the rules of the Abstract Machine are the rules that code has to follow. (It's just that rustc defines and implements the Abstract Machine.)


It would be ideal for users of std for std to specify exact bounds on misbehavior when maliciously crafted safe trait implementations are provided, yes. However, doing so is not the motivation here: the motivation here is doing the minimal change to guarantee that misbehavior does not violate the unenforced encapsulation between std types.


  1. And even though std contains unsafe, the documentation says that misbehavior cannot cause UB. ↩︎

2 Likes

I might reconsider after I see these formal specs of that Abstract Machine, but for now, I think of std APIs as part of that "Rust Abstract Machine". Therefore, a guarantee in std that says a function may do "anything that is allowed in the Abstract Machine" sounds circular and meaningless to me.

1 Like

Whether std is part of the language of is not part of the language makes no practical difference: It is bad enough if std can do anything regarding std because a lot of unsafe code in the world will (want to) use std data types in combination with unsafe.

I still believe a minimal fix would be sufficient to fix soundness outside std[1].


P.S.: Note that my OP does not only refer to std's documentation though, but also to the Rust reference, which also makes very unconstrained statements in describing the impact of wrong Hash or Eq implementations: "the program is considered erroneous and its behavior unpredictable." So it isn't really just a problem of std.


  1. std itself is already sound because it may rely on implementation details that aren't documented. ↩︎

Let me just add that this usage of "unspecified behavior" is very different from how C and C++ standards use the term. They never say "this leads to unspecified behavior" as a generic thing, they always say what exactly is unspecified -- the order of evaluation of function parameters may be unspecified, or the value of an i32 will be unspecified, etc. Not "the whole behavior will be unspecified".

Anyway, I think this whole problem can be solved by simply specifying something along the lines of (details to be worked out):

  • an operation XYZ on HashMap compares the lookup key X with key K in the map using Eq and Hash + Hasher
  • if for some K, X == K returns false, then K will not be found
  • if for some K, X == K returns true, and X's hash matches the last time K's hash was calculated, then one of these K's will be found
  • otherwise, if X == K returns true, but those hashes don't match, then K may or may not be found (that's the unspecified part)
  • (optional) if a key's hash is recomputed and it changes, a panic may occur (I don't know if this is necessary)

This is not fully formal, and slightly different choices could be made there, I just wrote this out to give the general flavor of how this would look, and that it doesn't seem that contentious to me.

I don't think it's really necessary to allow anything else to happen. I think std's implementation can be careful enough to never have memory leaks or infinite loops or other weird things. It probably doesn't do anything like that already. I can't even think of an implementation that would feel the need to do something like that.

If there is some fancy hash table data structure that does leak memory or go into infinite loops in those cases for some slight extra performance or whatever, I would suggest that it's a trade-off that shouldn't really by made by the standard collection, but rather by some specialized collection in a separate crate.

1 Like

Without knowing the C/C++ standards, I would assume this makes very much sense, because if you don't specify what can happen, it's effectively UB (even though you may argue it's formally not [1]).

Hmmmm, but may I ask what is the gain? Do I really need to rely on HashSet/HashMap terminating even if my Hash implementation is wrong? Both non-termination and wrong results (that might cause errors later in the program) are kinda on the same "level of evil" when it comes to debugging. Arbitrary changes of other people's Vecs is a whole different level though.

So my point is: Yeah, maybe the misbehavior could be narrowed down even further. But what's the net outcome when you look at the gain and the risks/constraints imposed?

A minimal fix would give the gain that unsafe code which relies on std can be verified in regard to soundness again (which was my original wish), and the costs/risks would be quite low (assuming it's possible to find a nice wording for a reasonable upper bound of what can happen in the worst case).

I'm far from being qualified to assess this weighing of pros and cons. People who are involved in the actual development of std or Rust will have more expertise to judge about that. (I'm merely a user of Rust.) But I currently don't see any reason to not start with a minimal fix and consider further guarantees later.


  1. But it may become UB in combination with unsafe code, as elaborated a couple of times in this thread. ↩︎

1 Like
  1. Simpler spec.
  2. Unambiguous spec. The idea that the collection can randomly change at any time is quite ambiguous to me (e.g. does that mean that a Vec that is inside a struct inside that collection can suddenly start changing too?).
  3. Much easier to (safely) deal with collections of unknown types than if you have to deal with the possibility that collection.len() != collection.len(), new objects randomly appearing, and other absurd things.

I just don't see why you'd want to have such a messy and complicated API when any reasonable implementation of a hash table I can think of does exactly what I said, regardless of how the hash function is implemented.