Using std::collections and unsafe: anything can happen

But what if the cmp is just random? What does it mean to say that it's a consistent linear order? Certainly if I look at them -- say, by iterating -- then I might see them not in order, depending what the randomness says when I check.

That the observed iteration order will stay the same over any period in which you don't insert or remove any elements. It's not randomized per iteration. And this is because, in the implementation, iteration is a tree traversal which doesn't involve any comparisons, but only the 'physical' data structure itself.

Even if cmp is totally random, the order inside BTreeMap will be perfectly consistent with everything that cmp has told BTreeMap.

BTreeMap insertion, deletion and lookup algorithms always make sure there is precisely one such order.

When you later ask cmp, you might get a different result, but that's not what cmp had told BTreeMap. Therefore, the guarantee should be: the order will always be consistent (and unique) with respect to everything that cmp returned when BTreeMap called cmp (not with what cmp may have returned when somebody else called it).

For instance, say you have a random cmp.

You insert A: no comparisons are made. Then you insert B. cmp(B,A) is called and it randomly returns Less. Then you insert C. cmp(C,A) is called and it randomly returns Greater.

At this point the order inside BTreeMap is B, A, C and is the only order that is perfectly consistent with everything cmp said in response to queries made by BTreeMap.

What if you insert D greater than C and less than B? Your order would have a cycle. The BTreeMap could put the elements in any order I would think. Especially if a tree rebalancing causes cmp to be called on previously inserted items again.

With a random cmp it is meaningless to say "D is greater than C and less than B", you have to give me a specific sequence of cmp inputs and outputs.

After BTreeMap calls cmp(D, C) which returns Greater, it will already know the order: B, A, C, D. It's not going to call cmp(D, B) next, because it already "knows" that D > B, so D does not become "less than B".

Tree rebalancing never calls cmp. The linear order is precisely known already before rebalancing, and is maintained during rebalancing.

I like that.

Of course, there could still be UB if there is unsound code which makes wrong assumptions about the misbehavior. Not sure if it's relevant to mention that, e.g. by adding something like:

[…], but does not include any behavior considered undefined (as long as all unsafe code involved is sound).

But perhaps that doesn't need to be mentioned, as unsound code can always turn misbehavior into UB.

To summarize what I said above, my proposal for BTreeMap would be to replace the whole section about "logic errors" with something along the lines of:

Normally, Ord should be implemented in such a way that it is deterministic and creates a total order among all possible keys. However, BTreeMap works properly even when that is not the case. BTreeMap will always maintain the unique total order among the keys that is consistent with the results of all the Ord::cmp comparisons that it performs on the keys, and will not perform any additional comparisons that might contradict that order.

In other words, there is no need for any "panics, incorrect results, aborts, memory leaks, or non-termination". The only unspecified thing is what comparisons exactly get performed, which in the case of non-total-order Ord implementation may affect the ordering, but doesn't involve any arbitrary misbehavior.

I think something similar can be specified for HashMap, but it's more complicated, because it involves three APIs interacting rather than just one (Hash, Hasher and Eq), and because hashes get recomputed multiple times for the same key in the current implementation, so unlike for BTreeMap, inconsistent information can be fed into a HashMap. But I think it's doable to specify how such inconsistent information may be treated without resorting to broad and vague language like "the behavior is not specified except the following things will definitely not happen".

I would not say BTreeMap works properly. If the ordering changes, then a lot of calls will not behave as expected. BTreeMap::range is currently specified to panic if start > end, for example; and determining whether start > end is done through implementation of Ord.

2 Likes

Admittedly "works properly" may need a slight elaboration. What I mean by this is: BTreeMap figures out one linear order using Ord queries and all operations work assuming that order.

I don't see any issue with your range example. It can still ask Ord whether start > end and still panic if Ord says yes.

If I understand it right (but I'm really not overlooking it well), then BTreeMap will have a consistent order when doing a full traversal. But how can you search for an element in the tree if Ord::cmp just returns random on each call?

You use Ord::cmp queries to implicitly place your lookup key somewhere in the sorted list of all the tree elements. If at some point Ord says Equal, you have found your key. If not, you have placed the lookup key between two consecutive tree elements and this proves that it is not an element according to what Ord has told you.

If Ord is random, then your lookup key will end up being placed at some random position in the sorted order -- the one and only position consistent with all the answers Ord has given you.

Key identity is determined by Ord, not by Eq. You may think that you have a key that is identical to what is inside the tree, but if Ord is random and says otherwise, then it's not the same key as far as BTreeMap is concerned. I still define it as correct behavior. What Ord says is the source of truth about keys for BTreeMap.

This by the way doesn't require any code changes (I believe). All this is already what BTreeMap does today.

Let me share an example showing where I see "inconsistency":

use rand::{random, Rng, thread_rng};
use std::collections::BTreeMap;
use std::cmp::Ordering;

struct Wobbly;

impl PartialEq for Wobbly {
    fn eq(&self, _: &Wobbly) -> bool {
        random()
    }
}

impl Eq for Wobbly {}

impl PartialOrd for Wobbly {
    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
        Some(self.cmp(other))
    }
}

impl Ord for Wobbly {
    fn cmp(&self, _other: &Self) -> Ordering {
        match thread_rng().gen_range(0..3) {
            0 => Ordering::Less,
            1 => Ordering::Equal,
            2 => Ordering::Greater,
            _ => unreachable!(),
        }
    }
}

const HELLO: &'static str = "Hey, Hello World!";

fn main() {
    let mut tree = BTreeMap::<Wobbly, ()>::new();
    tree.insert(Wobbly, ());
    // We can't assume that `tree` behaves consistent!
    let entry = tree.get(&Wobbly);
    for _ in 0..30 {
        let x = tree.get(&Wobbly);
        unsafe {
            let offset = if entry == x {
                5
            } else {
                // Should not happen, but does happen!
                1024
            };
            let hello = std::ptr::slice_from_raw_parts(
                (HELLO as *const _ as *const u8 as usize + offset) as *const u8,
                12
            );
            let hello_str = std::str::from_utf8_unchecked(&*hello);
            println!("{hello_str}");
        }
    }
}

(Playground)

Output:

…
elf.results.
elf.results.
elf.results.
elf.results.
Hello World!
Hello World!
elf.results.
elf.results.
Hello World!
…

Besides, if cmp returns random values, this could lead to a tree being imbalanced, which also will cause unexpected behavior (e.g. practical it could lead to de-facto non-termination due to slowdown, depending on how the tree is used).

If we allow non-termination as possible misbehavior, then this also covers a downgrade of performance on these "logic errors".

This is way stronger than I would ever be willing to say. At the absolute minimum, it needs an exception for debug mode to be allowed to do the occasional debug_assert!(chunk.first() <= chunk.last());, or other similar non-big-O-impacting checks. (We can't actually take advantage of that today, but hopefully once we get progress on -Z build-std we'll be able to let end users benefit from such checks too.)

And it's unclear to me how much one gets from a "it's not allowed to panic" restriction -- if you don't know whether the Ord is logically correct, you don't know whether it's panic-free either.

"Works properly" feels like the wrong description here, since most of its methods can no longer meet their postconditions.

And what does one do with a BTreeMap where you know things are in a consistent-for-the-calls-during-insertion order? That means linear iteration happens in a repeatable order for the specific container, but if I have a BTreeMap it's because I wanted something other than some-order-iteration.

While I agree this makes sense for element-by-element insertion, I think it could actually be restrictive overall.

For example, BTreeMap/BTreeSet::from_iter: use bulk building to improve the performance by xu-cheng · Pull Request #88448 · rust-lang/rust · GitHub uses sorting in BTreeMap::from_iter, and sorting wants to fall back to O(n²) algorithms for small chunks, and those algorithms often don't have the "no extraneous comparisons" rule (as can be seen by pidgeonhole).


I think that instead of doing anything container-specific we should start with making the generally-implied Rust expectations explicit.

That might be things like

  • The number of items in a std container doesn't change unless you call a &mut method on that container.
  • The std containers don't themselves use interior mutability or locking (though implementations of traits on their generic parameters, including the allocator, certainly might)
4 Likes

Not sure if I understand it right, but in terms of my original example, you'd give guarantees on Vec's behavior rather than on HashSet's misbehavior?

Maybe I misunderstood.

I don't think this is sufficient because even if HashSet can't randomly drop items from a Vec, for example, it could still delete files on disk (which is arguably might not be UB).

I think "as-if they didn't have interior mutability" is extremely strong and general. Is that enough? Too much?

Well, it really should cover "won't delete files" too... hmm. Perhaps add something like "as-if it does not access any other items except those documented to be accessed during conforming use"?

1 Like

I see. In that case, you could loosen my definition somewhat to say: it may perform some extra comparisons, and either panic if they are inconsistent or ignore the results, but still stick to the linear order that follows from a consistent subset of all those comparisons if it doesn't panic.

The goal isn't really to use such B-trees deliberately.

The goal is to have a specification of what this data-structure guarantees, because traits like Ord are not unsafe, and so you can't just assume it has some properties.

BTW, this is why Ord docs shouldn't be saying "Implementations must", because that sounds like "it's undefined behavior if implementations don't". The docs should at most say "implementations are recommended to" or "implementation should".

Unsafe code has to have some guarantees better than the current "it may do anything at all, except these 5 things that are listed on the memory safety list, and maybe a couple other things that we forgot to list there", because that's so open-ended that it's useless. For instance, the current sloppy situation has led to it being formally almost impossible to write any unsafe code whatsoever, as the OP described.

Some people have proposed solutions but I still see them as quite sloppy and open ended, so I tried to proposed a more formal, mathematical spec.

I think this is false. Insertion sort doesn't do any superfluous comparisons.

I thought about this some more, and I think you're right. It never does a comparison where it has enough information to know what the answer is already.

I guess the extra comparisons in the asymptotic behaviour can generate enough excess information to know that the triangle rule doesn't hold over the range, but they're not strictly superfluous.

  • Panic is needed for debug, at least.
  • Non-termination would cover degraded performance.
  • Memory-leak would cover items in the collections that cannot be reached anymore.

If we furthermore demand that the collection may act "inconsistent" in regard to its contents (not sure whether it should be able to "invent" elements that were never added though?), wouldn't this cover all possible things that can go wrong?

(Not sure about aborting either.)

Your code really has nothing to do with BTreeMap, it's that you implemented == to return random values, and then in your unsafe code you assumed that == must return true? Because offset will be 5 if == returns true, 1024 if == returns false.

No, it can't. Regardless of where in a B-tree you insert things, it will remain balanced.

I would not agree here. The tree doesn't behave "consistent". We can fix == and the problem persists:

-use rand::{random, Rng, thread_rng};
+use rand::{Rng, thread_rng};
 use std::collections::BTreeMap;
 use std::cmp::Ordering;
 
 struct Wobbly;
 
 impl PartialEq for Wobbly {
     fn eq(&self, _: &Wobbly) -> bool {
-        random()
+        true
     }
 }

(Playground)

Oh okay, you might be right, I wasn't overlooking this well.