Terminology around unsafe, undefined behaviour, and invariants

InfernoDeity · July 16, 2020, 10:42am

I would say this is false. While this is the only type of invariant that is compiler-known UB, UB is UB. If you violate an assumption of a library, I always consider that UB. I don't say, "well, there is deffered UB if you aren't careful". I say that its UB. UB is UB, reguardless of the level it happens at. Just because rust doesn't consider something as undefined behaviour doesn't mean that it isn't.

Ixrec · July 16, 2020, 12:27pm

That's literally a contradiction. We're talking about what the (future) formal specification of Rust should define UB as. Any notions of "library UB" will be built on top of that definition.

Nobody's claiming that violating your library's documented contracts is acceptable. That's certainly "UB" in the extremely vague "never ever do this or scary things will happen someday" sense, but we should not use "UB" that way because that's a precise term of art with a far more specific meaning. "Library UB" doesn't and can't directly affect what Rust the language is expected and guaranteed to do or not do when compiling your code. It only means the library can do arbitrarily scary things (up to and including invoking actual language-level UB).

TBH, almost all of this thread is just rehashing standard misunderstandings of the "UB" term, instead of responding to anything that was actually in the new blog post. I'm starting to wonder if we should make up our own jargon from scratch.

RalfJung · July 16, 2020, 12:42pm

This is not how UB is defined by the UCG or in the Rust reference. So please let us stick to common terminology here and not arbitrarily re-define things like "UB". We cannot possibly have technical discussions if everybody makes up their own terms.

Sometimes the UB I am talking about is also called "language UB" to distinguish it from whatever libraries are doing. What you mean could be called "library UB", and we should probably find a different name for it (not involving "UB" at all) to make things less confusing.

(To be fair, the Rust library docs call many things UB that are just library contract violations. It is not surprising that this is a common misconception.)

binomial0 · July 16, 2020, 12:44pm

[edit: removed a point that was already sufficiently made by lxrec and RalfJung]

There is a very big difference between "UB" and "UB if you aren't careful". Writing unsafe code is all about being careful to avoid UB. There are ways to violate a library contract while being careful, so that UB is impossible regardless of the implementation of the library. The "set the string to zero byte-by-byte" example from above is one such way: no matter how the implementation of std::str would change, this can't trigger UB because no methods on the str are called while it is in an invalid state.

(I'd argue that the library contract automatically allows this, because it doesn't make sense for a contract to disallow something the library code can never observe. Still, the library certainly doesn't make this operation UB somehow.)

ckaran · July 16, 2020, 12:53pm

I agree with this, and would like to propose that "undefined behavior (UB)" be reserved for language contract violations, and "contract violation (CV)" be reserved for the other situations. The latter term is imprecise and may need to be broken down further in the future though.

InfernoDeity · July 16, 2020, 1:57pm

Undefined Behaviour has been arround longer than rust. Undefined Behaviour is not some magic language thing that exists to allow optimization of programs, taken literally, it is Behaviour that is not defined to have any restrictions. An API can specify that a function has undefined behaviour in certain circumstances. In the C and C++ standards, it is defined to be:

Behaviour for which this international standard poses no limitations.

I would argue that it is rust that is arbitrarily redefining and taking control of terms which have a more general meaning

Undefined behaviour can be as simple as a library reserving a set of identifiers, and saying these don't exist, but if you access them, its UB, which serves to allow implementation details in public contexts. An exerpt from the "Basic Library Rules" of one of my (albeit C++) APIS:

Within the lclib namespace, or any namespace defined within it, names which start with a single underscore are reserved by the implementation. The behaviour of a program which names an identifier of such a form is undefined. Note - This serves to permit implementations to define additional names within the library which are intended to be used privately to implement required apis, without causing strictly-conforming programs to change behaviour - End Note

mjbshaw · July 16, 2020, 2:08pm

This is starting to get off topic. Yes, "undefined behavior" is a term used outside of the Rust community. But that's generally irrelevant because we're talking about undefined behavior in the context of Rust. And I don't think the usage of the term here is incompatible with its general usage.

The distinction that Ralf and others are trying to draw with the term "undefined behavior" is important because undefined behavior at the language level is the foundation upon which all other undefined behavior is built upon (i.e. library API contract violations resulting in UB).

If you want to discuss this further please fork this into a separate thread.

Moderator note: It was forked into a separate thread.

RalfJung · July 16, 2020, 2:19pm

The C standard is written in informal English. One unfortunate consequence of this is that the C/C++ standard uses "UB" both for "language UB" and for "library UB". But this distinction becomes crucial once you want to be precise; every single attempt to make C formally precise uses the term UB the way I am using it here (and that is why the Rust docs define it the way they do).

Also, the sentence you are quoting does not say why [language] UB exists -- so it in no way contradicts the interpretation that it primarily exists for optimizations. This applies at least to all sorts of [language] UB that were added "more recently", aka in the last 3 decades, to C/C++, such as strict aliasing or restrict. [Language] UB used to be mostly about differences between platform behavior, and some older parts of the C standard still show that heritage. Rust does not have such legacy so things are less muddied here.

(In contrast, library UB is mostly about being able to change implementation details of the library without affecting the behavior of [conforming] clients. This is yet another way in which language UB and library UB are fundamentally different.)

I already acknowledged that "library UB" is a thing, but it is not the thing we are talking about here. So please do not keep insisting that UB always means "library UB". Beyond that, @mjbshaw already said what I was about to say.

197g · July 16, 2020, 5:09pm

Contract violation does not quite sound strong enough. I would use this term as well for safe function that were given bad inputs for which they panic or abort, for example core::slice::copy_from_slice. It is not permissible for safe interfaces to lead to UB, so this is quite distinct from the unsafe kind of contract violation that comes from violating safety invariants.

Tom-Phinney · July 16, 2020, 6:05pm

The why not Safety Violation (SV)? Contract Safety Violation (CSV) would be confused/conflated with comma-separated values, so that alternative doesn't work.

We can continue to bikeshed names, but I think that the underlying idea – of using an agreed-upon different term for this class of violations – is good. Perhaps Ada or some other contract-focused or safety-focused language has an appropriate term and acronym that we could borrow.

zackw · July 18, 2020, 3:09pm

I also think "undefined behavior" is a problematic term and we should try to find something clearer.

When I've written documentation that touches on these issues for C, I find it works best to talk in terms of assumptions made by different parts of the implementation. Here's a couple of examples.

bool occupies the same space as a u8, one byte, but the compiler generates code assuming that the numeric value of that byte is always either 0 or 1.

The following function [ed.: from this old thread] reads memory beyond the space allocated to the slice x. This is incorrect, even if x is a subslice of a larger allocation, because the compiler will assume that it does no such thing while generating code for its callers.
fn reach_beyond(x: &[i32]) -> i32 {
   unsafe {
       *x.as_ptr().add(x.len()+1)
   }
}

A hypothetical Rust language standard would, in these terms, say that "the compiler may assume that (not X)" where the C standard says "the behavior is undefined if (X)".

Aloso · July 18, 2020, 3:50pm

I think the term should include the word "contract", "invariant" or something similar.

IMO there are two kinds of contracts, you could call them "safe contracts" and "unsafe contracts". A safe contract is one that is never exploited in unsafe code, so violating a safe contract can never cause UB. Unsafe contracts are used in unsafe code, violating them can cause UB.

For example, it's an unsafe contract that a str or String must be valid UTF-8, because their implementations use unsafe code, which relies on the correct encoding.

Functions that don't validate their input and can cause unsafe contract violations (e.g. from_utf8_unchecked) must be unsafe.

@zackw UB is a well-established term that has been used in the Rust community for a long time. Maybe not everyone is familiar with the term, but they can be referred to the Wikipedia page (or the page in the Rust wiki).

Unsafe code and undefined behavior is a difficult topic, so I think it's important to use precise, unambiguous terms to avoid misunderstandings.

zackw · July 18, 2020, 7:22pm

I think that, despite "UB" being a well-established term that has been used for a long time, it is both imprecise and ambiguous, and we need to find a better term. Witness all of the threads arguing over exactly what it means!

And the specific point I jumped into this thread to make is that the replacement for "UB" needs to be a set of terms, because one of the several problems with "UB" is that it covers a whole bunch of different scenarios that are related but not that closely. For each case, do we mean

This operation will produce an unpredictable and arbitrary, but still valid, value of the result type
This operation may produce an invalid value of the result type
This operation may trigger a hardware fault that terminates at least the current thread (and may crash the entire computer, depending on the environment)
This operation may or may not have the expected side effects depending on unpredictable, timing-dependent runtime state
The compiler is allowed to assume that a black-box subroutine does not perform this operation
The compiler is allowed to assume that the arguments to this subroutine satisfy this constraint
The compiler is allowed to assume that, at this point in the code, all observable values of this type are valid
The compiler is allowed to assume that this operation will trigger a hardware fault that terminates at least the current thread (and therefore any code beyond this operation is unreachable)
The compiler is allowed to assume that any code path leading to this operation is unreachable and may be deleted
... et cetera?

Julius-Beides · July 18, 2020, 10:11pm

The term "undefined behaviour" fokuses on the perspective of spec writers and compiler implementors, but isn't really meaningful for the actual language users.

What I mean is that as compiler writers it's natural to say for example: "We explicitly specify that dereferencing a null pointer is undefined behaviour, so we can do better optimisations, by ignoring edge cases around null pointers."

But from a language user's perspective who doesn't know how compilers work, that raises more questions that answers. Furthermore, it's actually very much defined what dereferencing a null pointer (and other undefined behaviour) does: You get segfaults, silent data corruption or other nasty bugs. So the term is misleading.

I would even say that the term "undefined behaviour" doesn't sound dangerous enough for what it does. We need a word that make people's alarm bells ring, even if they are just learning Rust. Something like "buggy" or maybe even "illegal"?

"Dereferencing a null pointer is an illegal operation" sounds more apropriate than "... is undefined behaviour", right?

Tom-Phinney · July 18, 2020, 10:28pm

zackw:

And the specific point I jumped into this thread to make is that the replacement for "UB" needs to be a set of terms, because one of the several problems with "UB" is that it covers a whole bunch of different scenarios that are related but not that closely. For each case, do we mean

This operation will produce an unpredictable and arbitrary, but still valid, value of the result type

This operation may produce an invalid value of the result type

This operation may trigger a hardware fault that terminates at least the current thread (and may crash the entire computer, depending on the environment)

This operation may or may not have the expected side effects depending on unpredictable, timing-dependent runtime state

The compiler is allowed to assume that a black-box subroutine does not perform this operation

The compiler is allowed to assume that the arguments to this subroutine satisfy this constraint

The compiler is allowed to assume that, at this point in the code, all observable values of this type are valid

The compiler is allowed to assume that this operation will trigger a hardware fault that terminates at least the current thread (and therefore any code beyond this operation is unreachable)

The compiler is allowed to assume that any code path leading to this operation is unreachable and may be deleted

... et cetera?

Any or all of the above. UB (Undefined Behavior) means that the compiler's optimizer and code generator—usually LLVM in most modern compilers—is no longer required to generate correct code for the submitted program. What the compiler will generate is unpredictable, and can change each time that program or any of its dependencies change, or each time the compiler is updated. The program's author cannot make any reliable prediction on what the consequences of UB will be, because even if they know what LLVM does in one release, they can't know how it will behave in the next "improved" release.

It is meaningful in the sense that I just described: if a program contains UB then its author—your "language user"—cannot assume that the generated code corresponds to what they thought they wrote.

InfernoDeity · July 18, 2020, 10:48pm

Undefined Behaviour holds a particular connotation. Using a different term really wouldn't have the same meaning.

Aloso · July 18, 2020, 11:53pm

No, that's wrong I think. UB means that you can not predict what is going to happen — unless you're the one who wrote the compiler and know exactly how the code is compiled and optimized. For example, if the compiler can prove that the pointer you're dereferencing is null, it is free to assume that this part of your code is unreachable and may remove it.

"Undefined" in this case means "not defined in the language specification". So a conforming compiler can do anything it wants when encountering UB, and different conforming compilers can do different things. The behavior can also differ depending on the target platform, optimization level and compiler flags.

zackw · July 21, 2020, 6:16pm

This reiterates what everyone already says about UB and doesn't move the discussion forward. Also, it's not even true. For any concrete case where the language standard doesn't define the behavior, there are concrete violations of assumptions permitted to the compiler, and/or there are concrete reasons why data access races exist, and/or there are concrete things that CPUs would do with a naive translation to machine language that are Not What You Want.

I'm starting from the premise that "undefined behavior means the compiler is not required to generate correct code for the program" has led to 20+ years of confusion, bugs, and finger-pointing about who is responsible for those bugs. My proposal for what to do about it is to scrap the term "undefined behavior" and instead say specifically what could happen in each case. Thus the list in my previous post.

Yes, writing down "specifically what could happen" might well involve putting bounds on what the compiler is allowed to do with what we currently call UB, that don't currently exist, and consequently to pushing changes down into LLVM that they might resist. But I see that as work that ought to happen anyway and worth spending persuasive effort on.

Tom-Phinney · July 21, 2020, 6:40pm

In summary, you would like to extend a language specification to constrain what a conforming compiler does when the language specification is violated, even though you acknowledge that in adding those constraints you are potentially reducing the extent by which a more-capable compiler could optimize valid programs.

My own opinion is that invalid programs should not forward-constrain any compiler from being able to produce more efficient code for valid programs. You may be willing to pay that price, but I am not, and I suspect that many other programmers would feel as I do that the suggested reward of better predictability of the behavior of language-specification-violating code is too costly to warrant that tradeoff.

There have been many efforts in the history of computing, such as capability-based architectures, to increase predictability at the expense of run-time performance. None of those efforts have fared well in the long term; the quest for improved performance has always determined what survives in the market.

InfernoDeity · July 21, 2020, 7:38pm

The one of the many reason so many things in the C and C++ Standards are undefined behaviour is because there is no exhaustive list of what can happen for every possible implementation. To specify one would at some level inhibit some particular implementation. I assume rust also would not want to make such an inhibition.

For example, you can't actually define that dereferencing a null pointer causes a SIGSEGV, or otherwise aborts the program, because then it would break in freestanding, and elsewhere address 0 is actually valid.

Further, the fact that "Undefined Behaviour" is literal, allows compiler to define extensions by giving meaning to certain cases of UB. In fact, argueably, a decent portion of the UCG is a report on rustc extensions to what could be called blanket UB. For example, transmuting Box<T> to OptionButNotReally<Box<T>> can be considered UB ("transmuting between [repr(Rust)] types is undefined behaviour"), even if it compiles, but it is valid in the case of rustc (given OptionButNotReally is defined the same as core::option::Option, w/o any lang items) because of the General Niche-Variant Optimizations which are documented by the ucg.

Topic		Replies	Views
Walking to the very edge of UB (solved, this one definitely causes UB) Unsafe Code Guidelines	10	1629	December 22, 2024
[Pre-RFC] Another take at clarifying `unsafe` semantics	41	4190	March 25, 2019
Why even unused data needs to be valid language design	20	3654	November 11, 2020
Ambiguity of "unsafe" causes confusion in understanding	5	1406	March 25, 2019
Std::io::seek	13	1085	March 25, 2019

Terminology around unsafe, undefined behaviour, and invariants

Related topics