Peculiar behavior when doing bad things

I ran into a peculiar... behavior today that I still don't quite understand. I do understand that an "undefined behavior warning" means I'll likely get undefined behavior. What I don't understand is why (in this particular case).

In this example, I am initializing a dyn pointer as null, using transmute. Obviously this is a bad thing to do, and the compiler is within its rights to get upset, but this doesn't really explain what happens next, at least not to me.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=7b3c8bde26708f8a49dd21ee8dd0d318

When I build this with --release, the compiler inserts an illegal instruction after the first println. Clearly, dereferencing this pointer would result in very bad things happening, but why is it that initializing it generates an illegal instruction in the produced binary? And if this is intentional, then why are we not generating a compile-time error rather than a runtime error?

Is there some aspect of initializing a dyn pointer that I'm overlooking?

1 Like

The compiler knows that dereferencing a null pointer is UB, so it infers that the code following it is unreachable and replaces it with unreachable. Rustc currently configures LLVM to lower the unreachable llvm ir instruction to the ud2 x86 instruction which gives a SIGILL when executed.

2 Likes

No. This warning means executing the warned code is UB.

Because it's UB, and thus anything can happen.

The vtable part of a dyn pointer currently has a validity requirement (but not guarantee) to be valid. It's not, so you have UB.

There's no dereference in the discussed code. The UB is constructing the pointer itself.

6 Likes

If the code never executes at all, you don't have UB. Also this warning doesn't catch all transmutes that cause UB, but only a hard coded set of transmutes known to always cause UB when executing.

1 Like

But what is it that "executing" this transmute actually does? What about this transmute makes it be defined as UB? I had expected it to at most shift some bits around to get two 64-bit pointers, which isn't undefined at all....

I can assign a *const T to a random value without UB, though I'll certainly have UB when dereferencing it. Why is it that the vtable part of a *const dyn T is different than the data part in this regard?

You can think of *const dyn T as a (*const ErasedType, &'static TVtable). Yes, that's a reference in there, even though you have a raw pointer. That reference is not allowed to be null, but you're doing exactly that, so you get UB even though you didn't dereference any pointer.

As for why this isn't a compile time error, it's probably because it either has some false positive or the compiler team didn't want to break some code that relied on it when it didn't use to warn.

8 Likes

Thanks, this is really helpful. It leads to a natural followup question though: why is it that we can't assign zero to a reference (in unsafe)? Is this simply to limit how far this type of bug can "leak" out of the unsafe block, or is there some more fundamental reason?

Also, since a compile time error is apparently not in the cards, and since we are anyway inserting code (the illegal instruction), is there a good reason not insert something more ergonomic like a 'panic!("Assigned zero to a reference!");'?

https://www.ralfj.de/blog/2018/08/22/two-kinds-of-invariants.html

A reference being nonzero is part of its validity invariant. This means a zero is not a reference. This is what allows e.g. Option<&T> to be only one usize big: a zero isn't a reference, so can be used to represent Option::None.

We're not really inserting code: you're executing UB, so all bets are off. We've just configured LLVM to insert ud2 to eagerly crash the program in some cases of known executed UB (but this is not in any way reliable and is primarily to avoid execution running off the end of a function).

Two reasons, actually:

  1. A panic is potentially recoverable with catch_unwind. Diagnosed UB should be a hard crash.
  2. ud2 is a single instruction, and panicking is a lot of code. Just using ud2 keeps this exploit mitigation technique much lighter.
6 Likes

Because that's the rule, and we tell LLVM to take advantage of it by marking them as ok-to-deference in the LLVM we produce (for example with https://llvm.org/docs/LangRef.html#parameter-attributes and https://llvm.org/docs/LangRef.html#dereferenceable-metadata) as that allows LLVM to produce better code -- for example, it can move loads through a reference outside of loop bodies without needing to check that the body will run at least once, because it's doesn't need to worry about the possibility of the pointer being null.

We don't want people starting to rely on it and saying "well, they said it's UB but actually it panics so it's fine" -- no, it's not fine to do it.

2 Likes

Here's an example of the difference, running the same code with a pointer -- which is allowed to be null -- and a reference -- which is not allowed to be null: https://rust.godbolt.org/z/zxPMT1eeG

pub fn demo1(mut n: u32, x: &u32) -> u32 {
    let mut sum = 0;
    while n > 0 {
        n -= 1;
        sum += *x;
    }
    sum
}

pub unsafe fn demo2(mut n: u32, x: *const u32) -> u32 {
    let mut sum = 0;
    while n > 0 {
        n -= 1;
        sum += *x;
    }
    sum
}

With the reference, that compiles to just n * *x, because references are never null

example::demo1:
        mov     eax, edi
        imul    eax, dword ptr [rsi]
        ret

But because pointers are allowed to be null, that one compiles to if n != 0 { n * *x } else { 0 } because it has to make sure it doesn't read through a null pointer:

example::demo2:
        test    edi, edi
        je      .LBB1_1
        mov     eax, edi
        imul    eax, dword ptr [rsi]
        ret
.LBB1_1:
        xor     eax, eax
        ret

That's why no, you're not allowed to make references be null even in unsafe code. If you want something to be nullable, use pointers or options-of-references.

Also, if you're wondering why the goldbolt link uses an old version of rustc, it's because there's a been a regression in LLVM. Looks like it's fixed in LLVM trunk, though, as I can repro the same thing in clang 13: https://cpp.godbolt.org/z/8bKb5WzTv

7 Likes

It cannot, usually, have such message, because the compiler usually cannot reason about the source of the UB. Having it panic with a general message is not better in any way than just crashing, except that you can recover from a panic, which we want to disallow.

Interesting, yes of course, I didn't make the connection that you could catch the panic which of course isn't really viable here. I don't see why it can't reason about the source of the UB: it just did (giving both a warning, and deliberately generating crashing code).

Still, it's very peculiar that we get only a warning about doing something the rules say we absolutely must not do, in a situation where there's no reason to crash other than to enforce the above rule. And then once the crash happens, the reason provided for the crash is almost completely opaque (illegal instruction).

In this case it can - that way it gave you a warning which you should not ignore. If you ignored it, well, this is your problem. But in most cases it can't.

Also it'll be very confusing if the compiler will print a message but only if it didn't find optimization opportunities (if it found it'll just apply them, removing the code completely).

Also, I'm not sure we can do this since we depend for that in LLVM. But I'm not an expert in this area.

The warning + illegal instruction gives room for those programs that have code for this bad thing, but never reach that part of the program. Those continue to compile and continue to run fine. This solution does not disrupt as many potentially working Rust programs, compared with having a hard error instead of a warning.

Understood, though I'm still not fully clear on why we need to check the pointer. It would seem the alternative is to let it fly, and have undefined behavior when dereferencing the pointer.

Similarly, it seems one might allow null references, and have UB when following the reference. I suspect the underlying motivation has to do with keeping bugs contained, but I'd love to learn more.

Because if n is 0, then the pointer isn't dereferenced in the original code, so it's not UB to call demo2(0, ptr::null()). And the optimizer isn't allowed to introduce a null deref that wouldn't have happened in the original code, thus it needs to codegen the check.

This kind of thing happens all over the place. It would be a broad pessimization of codegen to allow references to be null, and there's basically no advantage to it -- people who actually want nullable references can just use options, which don't even have extra space cost when they hold references. (And that "no extra space cost" is because of the non-null validity invariant.)

7 Likes

Interesting, thanks a lot for taking the time to answer my questions!

Because this code would "hide" the UB with strange behavior:

let null_ref = make_null_ref();
let opt = Some(null_ref);
// We just smuggled a null ref that caused this `enum` to transmute itself
assert!(opt.is_none());
1 Like

Also, if the optimizer sees something like

let opt = Some(foo);
assert!(opt.is_some());

it would like to be able to optimize it out entirely. (This is a useful optimization because after function inlining, there are often redundant tests which can be removed.)

It could not do so if Some(foo) was allowed to potentially actually be None.

2 Likes