Role of UB / uninitialized memory

Yeah, I was also pondering what the consequences of "uninit memory is just indeterminate, not UB to use" would be. Padding bytes are a problem for implementing memcpy, but "uninitialized" may not be the best description for padding, and I can't think of any other use case—you generally should not read truly uninitialized memory, no matter what exactly we count under that.

While you argued that we might be able to prevent LLVM from treating any heap allocated memory as uninitialized, getting "arbitrary bytes, not UB" semantics for mem::uninitialized would require either initializing (defeating the point of the intrinsic) or inserting freezes (once that exists in LLVM, it currently doesn't) on every load, either of which would cost some performance.

Finally, before this discussion even started we did see that could be implemented even if padding was uninitialized and reading uninitialized memory was UB (by using a union to represent the potential-uninitialized-ness). So I'm left without any strong reason, even if I couldn't argue that leaving it as UB had performance advantages. (I'm not actually sure I can, but apparently that's moot now?)

There are other reasons why uninitialized padding bytes in particular are problematic.

In low level code, it's a fairly common practice to cast a pointer from structure type to byte array (e.g. for caching data to disk), or byte array to a structure type (retrieving cached data, interpreting IO memory in drivers, etc). For types that satisfy certain requirements, this is perfectly safe, and exposing this capability in a safe way can cut down on bugs from incorrect conversions. I try to do that with plain::Plain - Rust .

Padding bytes complicate the matter, because they add a mostly invisible way to accidentally write broken code.

2 Likes

That's a footnote on the specification of memcmp. Footnotes are non-normative. The more important text for purpose of this discussion is § 6.2.6.1 "Representations of types -- General", paragraph 6:

When a value is stored in an object of structure or union type, including in a member object, the bytes of the object representation that correspond to any padding bytes take unspecified values.42 The values of padding bytes shall not affect whether the value of such an object is a trap representation. Those bits of a structure or union object that are in the same byte as a bit-field member, but are not part of that member, shall similarly not affect whether the value of such an object is a trap representation.

[footnote 42] Thus, for example, structure assignment may be implemented element-at-a-time or via memcpy.

My interpretation of this text is: when a structure object is freshly allocated (by any means) all of its contents are indeterminate (reading any field or padding byte, by any means, provokes UB), but once you initialize even a single member, all of the padding bytes cease to be indeterminate and become merely unspecified. So I stand on what I originally said: in C, as long as all of a structure's value fields have been initialized, you can make whatever read accesses you want to its object representation and at worst you risk unspecified behavior.

The intent is

struct S { int a; double b; };
struct S x, y;
memcpy(&y, &x, sizeof(struct S));  // undefined behavior

but

struct S { int a; double b; };
struct S x, y;
x.a = 0;
x.b = 0;
memcpy(&y, &x, sizeof(struct S));  // OK even if there is padding between a and b

and also (this reads the padding bytes out of struct S; note that I did not initialize x.b here):

struct S { int a; double b; };
struct S x;
unsigned char y[offsetof(struct S, b) - sizeof(int)];
x.a = 0;
memcpy(&y, ((char *)x) + sizeof(int), sizeof y); // OK
// sizeof(y) is implementation-defined
// the contents of y are unspecified
// reading x.b would still have undefined behavior

Note that this is different from a scalar type with trap representations, e.g. _Bool and signaling NaN.

From my reading of the standard, "indeterminate" does not imply "reading any field or padding byte, by any means, provokes UB". I cannot see anything in the standard saying this is the case. Most operations on indeterminate values are UB because they could be a trap representation, so e.g. performing integer addition is UB, but load and store do not care.

The expectation of C programmers (and, as far as I can tell, also compiler writers) is that memcpy is not UB as long as the pointers are in-bounds and disjoint. If we do our own loads/stores rather than calling memcpy, we have to take the "effective types" restriction (§6.5, paragraph 7; this is what enables TBAA) into account, so we have to use a pointer of character type.

I disagree; initializing a member is not "storing an object of structure type". This clause covers the case where you do x = y; and x (and y) are themselves of structure type; this makes x's padding bytes unspecified.

Still, I find it curious that padding here is merely unspecified, not indeterminate, values. Essentially (using LLVM lingo), copying something of struct type "freezes" the padding bytes, even if they were indeterminate in the source.

freeze on every load is not enough; we would also need a guarantee that independent loads yield the same value. I think the only way to do that in LLVM is to do a "load; freeze; store".

I'm afraid you've stumbled into one of the festering bugs in the C standard. (Festering in the technical sense — it's been there for a long time, and is still there, despite multiple attempts to correct it.)

Your close-reading of the text of the standard is in fact mostly correct. (You're wrong when you say "load and store do not care"; C11 [N1570] §6.2.7.1p5 specifically calls out "reading" and "writing" trap representations (via a non-character type) as UB.) But, an "indeterminate value" is defined as "an unspecified value or a trap representation" (3.19.2) and reading an unspecified value does not provoke UB (3.19.3). So, per the letter of the standard, accessing an uninitialized value of a type with no trap representations should not provoke UB.

However, the non-normative Annex J contains a stronger statement: "The behavior is undefined in the following circumstances: ... The value of an object with automatic storage duration is used while it is indeterminate." (J.2.1, eleventh bullet point). The behavior of both LLVM and GCC is as if this were a completely normative statement, and I will be very surprised if that isn't also the behavior of MSVC. This applies to objects of all types, whether or not they have trap representations, and indeed even to character types.

So, in the end, yes, "reading any field or padding byte, by any means, provokes UB" in the standard as it is implemented. It is my personal opinion that this is also what the committee intended to specify, they just didn't get it right, except in J.2.

(In terms of LLVM IR, I believe what I am saying is that "poison" values trigger full-fledged UB, and "poison" is what you get when you read an uninitialized value - by any means, of any type including character, heap or stack. I could be wrong about that part; everything I know about LLVM is from reading John Regehr's blog.)

You may be right about this bit. I took the phrase "including in a member object" in "When a value is stored in an object of structure or union type, including in a member object," to refer to initialization of individual members, but it could equally be talking about sub-structures. On the other hand, I don't know what else would make my second memcpy example well-defined and I'm certain that that is meant to be well-defined.

There is no section 6.2.7.1, I assume you mean 6.2.6.1. That is a very interesting paragraph, thanks for the pointer! (It can be really hard to find all the relevant paragraphs for what you are allowed to load from memory; for some reason these are scattered throughout the document...).

Do you have any concrete evidence that LLVM or GCC treat reading indeterminate values at a character type as UB? This contradicts everything I have seen so far in discussions and papers. In particular, this would imply that memcpy either is UB for partially initialized storage (think: memcpy on the entire baking store of a std::vector, including the uninitialized part -- I'm not saying that's a smart implementation, I'm just saying it's a valid one) or cannot be implemented, either of which I would consider a bug in the language. (I would also be interested to learn which of the two you think is the case.)

IIRC poison is generated only for things like integer overflow; uninitialized memory is undef [1]. Also, poison doesn't immediately UB, not even when you do arithmetic.

[1] http://lists.llvm.org/pipermail/llvm-dev/2016-December/107886.html says

Since our load today returns undef for uninitialized memory,

The LLVM rules are however currently subject for discussion, in the very thread referenced above [1]. Still, from what I saw, nobody proposed to make loading undef/poison values at character type UB.

I agree that it should be well-defined -- but of course, in my understanding, the justification for this is that memcpy is permitted to copy indeterminate values.

I don't know of a black-box test case that will prove this for a character type. I can black-box demonstrate that both LLVM and GCC ascribe convenient values to uninitialized char, e.g.

int main(void) { unsigned char x; return x != 0; }

will produce machine code equivalent to

int main(void) { return 0; }

but they could still do that if this only constituted use of an unspecified value. More compelling is the observation that both compilers' IRs can explicitly represent "this variable is uninitialized" (and provokes UB if used) but not "this variable has an unspecified value", and if you look at IR dumps you will see "this variable is uninitialized" sentinels used for unsigned char quantities:

*** IR Dump Before SROA ***
; Function Attrs: nounwind uwtable
define i32 @main() #0 {
  %1 = alloca i32, align 4
  %x = alloca i8, align 1
  store i32 0, i32* %1, align 4
  %2 = load i8, i8* %x, align 1
  %3 = zext i8 %2 to i32
  %4 = icmp ne i32 %3, 0
  %5 = zext i1 %4 to i32
  ret i32 %5
}
*** IR Dump Before Early CSE ***
; Function Attrs: nounwind uwtable
define i32 @main() #0 {
  %1 = zext i8 undef to i32
  %2 = icmp ne i32 %1, 0
  %3 = zext i1 %2 to i32
  ret i32 %3
}
*** IR Dump Before Lower 'expect' Intrinsics ***
; Function Attrs: nounwind uwtable
define i32 @main() #0 {
  ret i32 0
}

And further note that this still happens if you force x to be in memory with e.g.

static int foo(unsigned char *p) { return *p; }
int main(void) { unsigned char x; return foo(&x) != 0; }

the inliner reduces it to the original and the same thing happens.

I think memcpy was not intended to be allowed to copy indeterminate values, and was intended to have UB when applied to fully uninitialized storage.

I also think that the "padding takes on unspecified values" language discussed earlier was intended to imply that it doesn't have UB when applied to certain forms of partially uninitialized storage, e.g.

struct S { int x; double d; }
struct S s, t;
s.x = 1;
memcpy(&t, &s, sizeof(struct S)); // intended to be well-defined

but I am no longer certain whether the specification achieves that goal, and anyway it wouldn't appear to apply to anything other than struct and union types, so

int x[10], y[10];
x[0] = 1;
memcpy(y, x, 10 * sizeof(int)); // probably still UB

This is probably turning into another case of the "what exactly does the C standard mean by 'object'?" argument that has never been resolved to anyone's satisfaction, despite, again, multiple attempts to fix the wording.


I think the proper conclusion to be drawn from all of this is that the C standard's concepts of "indeterminate values" and "trap representations" are ill-specified and should not be used as a template for similar concepts in Rust.

I think we should maybe take a few steps back and try to pin down what we want out of mem::uninitialized and similar Rust language features. The main thing I know mem::uninitialized is useful for is result buffers for C library primitives, e.g. nix' signal.rs has

pub fn empty() -> SigSet {
    let mut sigset: libc::sigset_t = unsafe { mem::uninitialized() };
    let _ = unsafe { libc::sigemptyset(&mut sigset as *mut libc::sigset_t) };

    SigSet { sigset: sigset }
}

This is effectively a promise to the compiler that sigset was somehow initialized by the time we get to reading it from safe code, so it does not have to emit code to initialize it. I would really like to find a way to express that promise in a way that the compiler actually understands, but it's a complex thing to express, especially when you start thinking about system primitives like read and recvmsg that can fill in some, but not all, of an uninitialized buffer (and that buffer might be a scatter vector, for extra headaches). Maybe the people doing formal proofs of correctness on unsafe code have some ideas?

There's been a lot of talk in this thread about the safety of copying partially uninitialized aggregates. Here I think we need to have a conversation with the LLVM team about what their expectations for that sort of thing are — I did a bunch of tea-leaf-reading above, but I don't actually know. A rule that would make sense to me as someone writing code in Rust would be: copying is always safe, but preserves uninitializedness: e.g. the equivalent of the C above

struct S { i: i32, d: f64 }
let mut s: S = unsafe { S { 0, mem::uninitialized() } };
let mut t = s;

does not have UB in and of itself, but the field t.d is still uninitialized and a subsequent read will have UB. This is what valgrind does, which indicates that it's been found useful in separating real bugs from chaff.

(I would also be perfectly fine with the existence of a std::mem::byte type such that it was always safe to transmute a datum, initialized or not, to &[byte] and examine the contents. I don't think either i8 or u8 should have that property, though.)

(These proposals might cause difficulties for code generation on ia64, where uninitializedness is sort-of visible in the hardware (the NaT bits on registers); but frankly as far as I'm concerned ia64 can go jump in a lake.)

I can't speak for GCC; but the usual description and the behavior of LLVM's undef pretty much describes "this is an unspecified value".

But anyway it seems we can agree that the C standard is fairly subtle here and there are questions left wrt. what it actually means. Some more discussion of this and related subjects can be found at http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_451 and in Robbert Krebber's PhD thesis.


I think if we ignore implementation concerns, I think there's a fairly straight-forward answer, and it's what is currently implemented by miri: Bytes (in memory) and values (for a computation) can be "undefined"/"uninitialized" (miri calls this Undef); this is what memory contains after mem::uninitialized() and let x;. Loading Undef and storing it is fine, but any other use (arithmetic, deref, conditional jump, whatever) is UB.

This seems to match most/all of your expectations.

Of course, the trouble is that we somehow have to properly compile this to LLVM.

I think there is one thing that can be done without much controversy, and that is providing a compiler switch that turns all padding bytes, heap allocations and mem::uninitialized() into explicitly zeroed (or at least random but well-defined) memory. Possibly individual switches for each of the three, if we’re feeling fancy.

Of course, a lot of people would be uncomfortable using it because of the supposed performance hit, but having it present would help out paranoid folks like me (as a hardening option), as well as allow benchmarks to evaluate the actual impact of it.

2 Likes

Padding bytes themselves aren’t a source of uninitialized bytes, so they don’t fit into the same category as heap allocations and mem::uninitialized. The reason padding bytes come up is that they can remain uninitialized even if the struct is otherwise fully initialized. If heap allocs and mem::uninitialized are always zero, uninitialized bytes cannot enter the program (except via FFI), and hence padding bytes are never uninitialized.

For the heap, glibc has mallopt(M_PERTURB, val) or MALLOC_PERTURB_=val in the environment.

What about padding bytes in a value on the stack?

How would a stack value have uninitialized bytes, other than via mem::uninitialized?

let x = (0u8, 0u32);

Aren’t the padding bytes here still uninitialized?

(edit: removed an “incremental” initialization example - I forgot that doesn’t work.)

Ah, you are right. Yes, struct (and enum) literals also have to ensure to zero-initialize things. I think in LLVM this compiles effectively to let x; x.0 = ...; x.1 = ...;.

They are a source of uninitialized bytes in C:

When a value is stored in an object of structure or union type, including in a member object, the bytes of the object representation that correspond to any padding bytes take unspecified values.

So even a struct which is initially zero-initialized can become uninitialized. Not sure if LLVM ever takes this wording to heart and creates undefs.

(Note that “indeterminate value” is defined as “either an unspecified value or a trap representation”, so aside from trap representations, “unspecified” is just as bad as “indeterminate”. In particular, while reading them theoretically isn’t undefined behavior - which @zackw suggested above is a drafting error, at least in the case of indeterminate values - it’s almost as dangerous. “Unspecified value” is defined as “valid value of the relevant type where this International Standard imposes no requirements on which value is chosen in any instance”, and at least some people interpret “in any instance” as “any time it’s read/used”, i.e. the value doesn’t have to be consistent.)

I am not sure what you base this on - as far as I know, it's the trap representations that make indeterminate values "really bad".

As you said, he was speaking about indeterminate, not unspecified, values.

That article is behind a paywall. :confused: Could you cite the relevant parts?

Anyway we don't depend on C here, we depend on LLVM. I would be hard-pressed to believe that copying an undef-free range of memory at the right type could result in undefs.

Sorry. Actually, I read it through Google cache - here's a link.

Relevant bits:

Looking at DR#451 itself, the proposed response that would specify 'any operation on indeterminate values will have an indeterminate value as a result' would also change the definition of "indeterminate" so this property doesn't apply to merely "unspecified" values. However, there's also a bit that says perhaps padding bytes should act that way anyway:

(emphasis mine)

Anyway, I think LLVM probably doesn't actually generate undefs for this, but I'm not sure :slight_smile:

The issue for heap is not exactly about what's in there. As I explained some time ago, heap allocations are little more than an external function that returns a pointer. The problem here is that Rust memory model itself considers the memory to be UB to read, as a special case. This has no analogy on the system level, it's only there to admit optimizations by the compiler. The switch I'm suggesting wouldn't actually need to involve any writes to the memory. It could very well just disallow the optimizations that depend on the memory being undefined (simply by not considering it to be undefined in the first place).

Not just literals, too. There is no guarantee that an arbitrary assignment uses any sort of memcpy() analog. That could be inefficient for data that have large amounts of padding. It is not unheard of for a structure to take up several times the size of its fields in memory, because of alignment requirements and whatnot.

C is not exactly famed for running a tight ship when memory is concerned. In terms of LLVM, all you need to do is claim to it that those padding bytes are real data.

malloc is special in a bunch of ways, e.g. it comes with an annotation saying that the resulting pointer does not alias any previously existing pointer. ptr::offset is UB if the pointer arithmetic leaves the "allocated object", so a definition of this behavior has to know about how memory is allocated in blocks. None of this has an "analogy on the systems level". The same applies for other sources of UB in C, like integer overflows.

This is not an accident, it is a feature. It turns out that programs are much easier to analyze, and hence much easier to optimize, if you have a more abstract model of the machine. Languages like Haskell take this very far. This results in a conundrum for C: One the one hand, people want code to be fast and optimized in smart ways -- but on the other hand, these optimizations are actually wrong if you have the power to observe the machine "on the systems level". Much of the complexity of the C standard arises from trying to reconcile these to worlds, and they are actually doing a remarkably good job.

One consequence of all of that is that "this abstract model is not actually how the machine works" is not an argument against using it as a model. That said, there should usually be a justification for deviating from "actual systems behavior". The justification for doing this in the case of malloc has already been brought up -- the compiler should be free to transform heap to stack allocations to registers and vice versa. Not having three different kinds of storage simplifies the model. Treating uninitialized data special for registers is immediately useful, it means the compiler can just insert any random register as the source of uninitialized data, with no guarantees about the data being stable or well-formed for any given type. Consequently, all storage has a form of uninitialized data that is different from "unknown bit patterns".

1 Like

That's all well and good, but I fail to see how it pertains to the discussion at hand. I'm not debating the usefulness of certain methods and constructs having more assumptions attached to it.

One consequence of all of that is that “this abstract model is not actually how the machine works” is not an argument against using it as a model.

It's not intended as an argument against the model, it's an explanation aimed to show that some aspects of the model can be relaxed easily and without breaking the universe or touching many things (even if certain optimizations become inapplicable). I merely proposed that a compiler switch can be added that removes certain assumptions for the purpose of hardening against UB. The point of all I've written is that eliminating all sources of undefined memory is much easier than it sounds.