Terminology around unsafe, undefined behaviour, and invariants

I think completely getting rid of UB is a bad idea. For example, if dereferencing a null pointer/reference was defined and was guaranteed to cause a panic, this would mean that 0 is an allowed value for pointers and references. Not only would that mean that Rust would have to do a null check every time something is dereferenced, it would also mean that Option<&T> doesn't fit in a pointer, since Some(0usize as &T) must be representable.

4 Likes

That's not the best example.

You could define *(0 as *const u8) to raise a SEGFAULT without allowing null pointers in safe references.

I believe @zackw's point here is that there is some set of assumptions a compiler is able to make due to certain operations being defined as UB, and some set of potential outcomes from running any possible optimization defined in terms of that assumption, or by just translating some illegal operation into the machine code to perform said illegal operation (such as a null dereference).

There are, as I understand it, two main issues with that approach:

  • it's not even close to true today, and trying to represent it as such is actively harmful to understanding UB. Compilers will often put soft boundaries on what UB can do (it will only impact running your executable, not compiling it; it won't actually summon nasal demons or eat your laundry; etc). As an extreme example, imagine

    static mut BUF: [u8; 0x1_000_000] = [0; 0x1_000_000];
    fn main() {
        download_data_from_internet_into(&mut BUF);
        __mark_executable(&mut BUF);
        goto &mut BUF;
    }
    

    There is no way to put bounds on what code injection does.

  • It's really, really hard to define UB in terms of "allowed optimizations". People have tried. And as soon as something is defined behavior, software will rely on it. (People write software that relies on a particular compiler's expression of UB being nonproblematic quite frequently, unfortunately.) Examples of how UB allows for optimization are all fairly simple -- assume it doesn't happen, don't generate code to handle the case where it does -- but that's because these are examples that are supposed to be somewhat approachable to people who don't write optimizers as their day job.

The third, smaller big ticket problem with replacing UB with (potentially disastrous) defined (within a set of possibilities) behavior is that it relies on "what the machine does". You could probably (with a lot of effort) document exactly what behaviors are possible on x86 as a result of UB. But if you compile the same code to a different architecture, maybe one that tracks uninitialized memory in registers, the set of possible behaviors is completely different.

UB is as infectious and wide as it is for a reason, and that's because scoping it in any more severely limits the ability of your compiler to turn your high-level code into something that runs on the dumb silicon in your machine. And for what, behavior that's still definitely not intended and needs to be never executed anyway?

The solution to UB being dangerous is not to make UB less dangerous, it's to stop flirting with UB in the first place. Keep the UB holes for the optimizers to work with, but make things possible to do without opening up the portal to UB. And that's what Rust is great at!

8 Likes

What about *(0 as *const [u8; x])[x]? There will be some value of x which hits an allocated page.

I'm only directly responding to @CAD97 here but I think the quotes cover what everyone else said:

This is essentially correct, although I would put it more strongly: Whenever some operation is (currently) defined as UB, there exists an equivalent set of statements that a hypothetical language standard could make, in terms of assumptions allowed to the compiler and outcomes possible at runtime, and it would be clearer to write the language standard that way. I mean "equivalent" the way a mathematician means it: there would be no new constraints imposed on the implementation by writing the spec that way.

(We might discover, in the course of writing the spec that way, that some new constraints would be desirable, but that would be a separate change proposal.)

But it is, in fact, true today! Take the simple example from the parent thread. "The program has undefined behavior if the storage for a value of type bool ever holds a bit pattern other than 0000 0000 or 0000 0001" is 100% equivalent to the pair of statements "The compiler is allowed to assume that the storage for a value of type bool always holds one of the two bit patterns 0000 0000 and 0000 0001. The compiler is allowed to assume that any control-flow path that it can prove writes a bit pattern other than 0000 0000 or 0000 0001 to the storage for a value of type bool will never be taken." There simply aren't any other possibilities.

I don't think this is what we call "UB" right now; nor are assembly inserts nor FFI, both of which have the same level of opacity to the compiler. It's unsafe, because the onus is on the programmer to ensure that the machine code that's being downloaded and executed maintains all of the program's invariants. But the language spec can and should require the machine code to be executed. I would write something like "In any situation where the program invokes code with semantics opaque to the compiler (e.g. assembly inserts, FFI, just-in-time code generation) the compiler may assume that this code obeys all of the same requirements placed on correct unsafe Rust code."

Yes, but it's not hard to respecify UB in terms of allowed assumptions. People have also tried to do that, with notable success. John Regehr's "Friendly C" project is the most prominent example. (That project stalled out because it was also trying to reduce the set of allowed assumptions and couldn't reach consensus on that point. We don't have to do that part, at least not at first.)

The difference is that there are always optimizations we haven't thought of yet, but only so many assumptions that an implementation could make about some piece of data or code.

The set of behaviors that are possible on any abstract architecture that meets a few basic requirements that we already make (binary representation for integers, twos-complement negative numbers, certain overflow cases can be made to wrap around) is finite and knowable. I don't think it's even that hard to enumerate.

My goal here is not to make UB less dangerous, but to make it easier for programmers to know when their code is doing something it shouldn't.

Yes! SCO Unix C compiler did not necessarily cause a SIGSEGV when dereferencing a pointer. The address 0 was a valid address to read (but not write). As a result, a program that used a null pointer in a read would have mysterious unexplainable bugs that were incredibly difficult to track down.

If you're not changing the observable manifested behavior of UB at all, I fail to see how this meaningfully changes the status quo for UB detection at all.

If I read that correctly, your position boils down to flipping any "if operation X, then the program exhibits UB" to "the compiler may assume that not X". This does remove dependence on the term UB (and to the main topic here: I'm in weakly favor of deprecating the term "Undefined Behavior" and just using UB as the industry term) but just replaces it with an approximately as nebulous "the compiler may assume".

Ideally, a language is defined in terms of executing the language, not in terms of the process of compiling it. The latter is a step removed from what matters -- the resulting program -- even if it's slightly easier to understand for some group of developers.

What UB actually means is that whatever causes the UB can be assumed to not occur. Any instance of UB is theoretically equivalent to checking the precondition and doing unreachable_unchecked if the precondition fails (ignoring performance).

(Plus, I recall you saying you wanted a defined possible outcomes from e.g. dereferencing the null pointer? That's incompatible with "just" flipping "X = UB" to "assume not X". And compounding that is the fact that most UB attacks reduce to UB makes it possible to do code injection, which, again, throws out any possible bounding of possible behavior as a result of doing some operation you told the compiler wasn't going to happen.)

4 Likes

I don't think this is very helpful at all to the understanding of UB. Firstly, it's longer which is detrimental to the definitional aspect. Secondly, its wording—mostly that it can prove—is confrontational. This makes it seem as if for every property there is some defined set of operations you might be able to execute by which it is possible to reliably ensure compiler can not prove that particular property. This does not exist and exactly that misunderstanding is the cause of much careless introduction of UB. Worse, it will change in newer compiler versions and then make the program exhibit the usual UB symptoms. And that will be used as reasoning for staying on old compiler versions instead of admitting that code is broken. This is not a guess, it's exactly what happens in some C programs.

2 Likes

This is technically true, but doesn't really constrain what the compiler is allowed to do. The invalid bit patterns in a bool can be used as a niche (e.g in Option<bool>), so writing an invalid value to a bool may have unpredictable side effects.

That essentially brings us back to "U.B means anything can happen", since "invok[ing] code with semantics opaque to the compiler" can happen in any number of ways. Consider overwriting a function pointer with a user-provided value: if the stack can be set up correctly (or a useful function already exists in memory), code execution can be obtained when that function pointer is invoked.

Presumably readers in this forum know that UB means that the language does not specify semantics for whatever is UB. However I find it useful to consider UB as "Unspecified Behavior by the compiler". In other words, if your program has UB, you have no reasonable expectation of predicting what the compiler will do, and thus what the resulting code will be.

While this is entirely accurate, we should be careful with this specific wording since "unspecified behavior" is a term of art in C++ with a distinct meaning from "undefined behavior" (and we're definitely going to need a similar concept in our formal spec, though we'll likely call it something else). In fact, IIUC zackw is effectively arguing that we should make more behaviors C++-unspecified instead of C++-undefined, hence I felt the need to nitpick this immediately.

The fact that I felt the need to post this really underscores the point that existing terms suck for this. I'm all for experimenting with stuff like "safety violation".

3 Likes

I have watched for months/years the confusion in this forum, and even more in URLO, that the term UB engenders. That's why I offered the above interpretation of UB. I doubt that there is any experienced programmer who would be at all confused if they understood: When your source code contains UB you CANNOT rely on the output of the compiler—not now, not ever.

4 Likes

I think this thread is starting to go around in circles and also, speaking personally, it's setting off my desire to argue point by point by every last point, which is maybe not the most constructive use of either my time or anyone else's.

I also think we've gotten several change proposals tangled up together, some more controversial than others. I would like to table everything else and focus on just two things: (a) "undefined behavior" / "UB" is a bad term; (b) any single replacement term, whatever it may be ("safety violation", "validity invariant not honored", "ill-formed no diagnostic required", ...) will have the same problems.

I'm going to respond directly to just one comment which I think gets right at the heart of the disagreement between me and everyone else:

My experience, from ~8 years of working directly on GCC and another 12 years of attempting to explain to other C and C++ programmers why they couldn't trust the compiler to emit what they thought was "obviously" the correct translation of code whose behavior was undefined, is that this is wrong.

If someone hasn't already accepted the perspective taken by compiler devs, "When your source code contains undefined behavior you cannot rely on the behavior of the compiler, not now, not ever" is absurd. It is so obviously wrong that they must have misheard you. It will go in one ear and right back out the other and make no impression at all. Isn't the compiler a deterministic program, after all? It has to do something with this code and it sure looks like it does the right thing right now; why shouldn't they rely on that? If the code is erroneous, shouldn't they get a compile-time error? And this is true no matter what noun phrase you replace "undefined behavior" with.

"[This specific piece of code] does [this specific thing] which the compiler is allowed to assume you don't do, so a different compiler, or a future version of this one, might mis-optimize it", on the other hand, points at a concrete problem, and it explains why it's a problem in terms that make sense to people who don't speak standardese or compiler-internals-ish, and why it might break in the future. Similarly for "this construct runs fine in this environment but [on other CPUs / if the memory layout were a little different / if someone fed it more input than expected] it might trigger a hardware fault." You've got to be specific and concrete and give some plausible way in which the code really might not do the expected thing. Then it doesn't sound like you're talking about nonsense hypotheticals anymore, it sounds like you have actually seen it go wrong.

4 Likes

Which release of which compiler? rustc 1.43.0? 1.45.0? 1.46.0 nightly? A compiler with a cranelift backend? And for which architecture, and with what options?

Sure, any specific program that avoids using random numbers (intentionally or otherwise) is deterministic, but that does not make the behavior of the aggregated collection of all Rust compilers, in all versions, over all architectures, with all sets of compile-time options, usably deterministic.

The term UB means that the programmer cannot predict, over the lifetime of the source code and however it might be deployed, that the code will always work as intended. Because the source code DOES NOT meet the requirements of the Rust language, any compiler of that code is free to screw it up in a manner that is not predictable, over time, to even the most experienced compiler dev.

4 Likes

No, it isn't. That's the whole point of UB-driven optimisation: to allow the compiler to emit assembly code that isn't a 1:1 serialisation of the abstract syntax tree. In this case, to allow it to realise pointer dereference in a manner different from emitting a load or store instruction, that doesn't replicate those instructions' behaviour on invalid accesses.

The compiler promises to preserve only the semantics of the Rust abstract machine, not the implementation details of how this machine is realised on a particular target. And it's the former that you are supposed to program against when writing Rust.

5 Likes

Add to that the question of which compiler? rustc, mrustc, lccc (WIP), some other compiler which exists, will exist, or may exist? Versions of compiler pass plugins? Ah you already covered that (I'm blind aparently)

The sole exception to the predictability rule is if the compiler itself promises to execute the instance of undefined behaviour in a particular way (See compiler extensions), and either you target a particular vendor, or enough vendors agree on something that it becomes a defacto standard (for example, if enough rust compilers agree on UCG, it doesn't necessarily need to be standardized).

Another thing – I sometimes see people arguing that violations of the API contracts of libraries (‘library UB’) should be distinguished from violations of invariants of the language itself (‘compiler UB’). While it's superficially appealing, I don't think it's going to help much.

In a sibling thread, there is a proposal to make File values containing a negative file descriptor illegal, to help with ABI layout optimisations. If adopted, this proposal is going to escalate what previously was ‘merely’ ‘library UB’ into ‘compiler UB’, without really changing the library contract – since what constitutes a violation of the contract doesn't change, only the consequences thereof.

If libraries allowed to do that with impunity, then the distinction between ‘compiler’ and ‘library’ UB becomes moot. In the end, UB is UB, it doesn't really matter where it comes from.

1 Like

It can't be done with impunity, though. The standard library defines a convention that language UB can only come from misusing unsafe functions, and FromRawFd::from_raw_fd is unsafe.

The standard library defines lots of contracts. For example, the Hash API says that equal values must evaluate to equal hashes. Yet, since that API is "safe," violating that contract must not invoke language-level UB.

If there was no distinction between violating a language contract and violating a library contract, then there would be no difference between the unsafe FromRawFd and the safe Hash. The standard library cannot provide service unless you uphold both contracts, but only one of them is allowed to result in type-system-violating memory corruption.

2 Likes

This is because the C++ standard has UB which is absurd and uses it as a cop-op when their compiler diagnostics in the real world are broken. We find properly questionable things such as:

If a non-empty source file does not end with a newline character after this step (whether it had no newline originally, or it ended with a backslash), the behavior is undefined (until C++11)

To my knowledge Rust does not have this, its UB is related to semantics of program execution. Feel free to show me a counter example if I just missed it. You're only allowed to use stable syntax in stable Rust and everything else is diagnosed. The Language team is very careful that any new syntactical feature is backwards compatible as well as leaves room for future additions (within the same edition). (Okay, there's #[no_mangle] with which you can do questionable things akin to C++'s symbol definitions that are UB but that's a bug).

1 Like

I'd argue that annotating something #[no_mangle] is necessarily an unsafe construct, and its use is tantamount to an unsafe block or unsafe impl. If not, its entire existence is unsound (and it not existing severely limits how rust can be used in an embedded/freestanding context, as well as in a sufficiently low-level hosted context, such as when writing an implementation of a platform libc).

I agree with this observation; that's why wording like this should be part of teaching people what UB is/means. But this changes nothing about how the term is defined and used in the spec. A language spec should be concise and precise, e.g. by avoiding any mention of "compilation" and focusing entirely on "behavior when executing this Rust program". If the spec is not readable for beginners that's okay, it's not a tutorial after all.

I (still) very strongly disagree. As always when people propose to mix up language UB and library UB, suddenly we find ourselves talking about semver (it's about changing the library implementation and we are arguing whether that change is breaking or not). Langauge UB has nothing to do with semver, while the contract imposed by a library on its client of course has a lot to do with semver. It would be absurd for the language spec to talk about the File type, but it is crucial for the language spec to precisely nail down language UB.

I agree that from a user perspective, it doesn't matter if your code is wrong because of language UB or library UB. And maybe that is where all the disagreement about this distinction is coming from. But this is IRLO not URLO, and from a language designer perspective, it makes a huge difference. Please let us not conflate issues of "inexperienced users will get confused about this" with "what is the best way to precisely define Rust semantics and reason about it mathematically" and "what is the best way to specify a library API in a forward-compatible way".

It also makes a difference in terms of diagnostability; for example, Miri could one day reliably detect all language UB but obviously there is no way it can ever detect all library UB.

EDIT: And of course it makes a difference for a library designer, who gets to specify the contract and thus the "library UB" of their own library (and gets to temporarily violate any associated invariants in library-internal code), but has to design this contract in a way such that language UB (and library UB of any dependent libraries) is avoided.

Also see this long Zulip thread.

12 Likes