Pointers Are Complicated II, or: We need better language specs

Is this a question about rust? Because that particular problem is already solved by the rust aliasing model / Stacked Borrows. I don't really know what constitutes a top level object to you but that memory manager would have some issues with stack allocated objects, which are not uncommon in rust programs. (Also, your example isn't in Rust or C/C++, is that OCaml?)

Zero-copy parsers do it all the time, but they're usually treating the source as immutable, so that's probably not super-relevant.

As for treating "top level" objects specially, and allowing different aliasing rules for them, Rust already has that, too. It's called UnsafeCell, and its wrappers, which include Cell, RefCell, Mutex, and RwLock. You can't technically accomplish your example using them, but they generally do confer a lot of flexibility, and the rules governing them seem close to the ones you gave. Namely, two Cells can alias, but partially aliasing them could cause pointer invalidation issues, so it isn't necessarily safe.

Is this a question about rust? Because that particular problem is already solved by the rust aliasing model / Stacked Borrows. I don't really know what constitutes a top level object to you but that memory manager would have some issues with stack allocated objects, which are not uncommon in rust programs. (Also, your example isn't in Rust or C/C++, is that OCaml?)

I was curious about Rust, but I don't know that language. For my example, I was trying to avoid disturbing the moderator by mentioning a certain letter-named language, so I used Pascal. I'd originally had two code snippets, and had specified that the first was Pascal, but then I decided the first example was superfluous and snipped it along with the note specifying the language.

When I use the term "top-level object", the basic idea would be that all references that could observably identify the same storage would be always bound to a single top-level object. If P and Q are top-level pointers, and p and q are references individually derived from them via any means, the only way p and q could be observably associated with the same storage would be if P and Q were equal. Directly-declared objects would be top level objects, as would be objects allocated on the heap. If foo is a directly-declared integer of static duration, and boz is an instance of a structure type with an integer member named bar, then foo would be a top-level integer object whose address could be persisted, and to which a stacked-borrow reference could be formed, while boz.bar would be a non-top-level integer object whose address could not be persisted, but to which a stacked-borrow reference could generally be formed; borrows against boz.bar would need to be reconciled with boz.

If a function has borrowed references to that integer and that structure, neither would--for purposes of the function--be a top-level object. The function could stacked-borrow a reference to either the structure, or to its members; as with the top-level structure object, a borrow against a member would be reconciled with the parent.

How often would high-performance-computing tasks need semantics that wouldn't fit such a model?

Not to put too fine a point on it, but if you aren't familiar with Rust then you might have some difficulty proposing an improvement on the aliasing model that Rust uses, which is the result of several years of careful thought (by many people but by Ralf Jung, the OP, especially, since he wrote the book on it). Furthermore, in order to understand what the aliasing model needs to be able to do, you have to know what Rust code is able to do, and that again requires some familiarity with the language. It looks superficially like C but the borrow checker is not something you have ever dealt with in another language, so there isn't really any substitute for this.

8 Likes

This is specifically a forum "for discussing everything related to the implementation and design of the Rust programming language." Yes, the original article is not specific to Rust, but it's expected that the discussion on this forum would take a Rust-centric outlook on it. If you want to ignore Rust, then this shouldn't be the forum where you're discussing it; Reddit and Hacker News both also had threads on it.

Based on a rough understanding of what you're saying -- you're basically describing how Rust normally works. Safe rust has NO UB by following a stacked model of borrowing, which was formalized and proved correct by Ralf here. The answer Rust is built upon is that the answer to this question is "rarely," so long as building blocks are provided.

Still, we offer unsafe because it's necessary to implement certain things, typically data structures, which can then be used from the outside safely.

Please, I suggest familiarizing yourself with Rust before accidentally making a fool of yourself on a Rust-centric forum. Everyone else here is intimately familiar with Rust (at least the safe, nonformal parts).

4 Likes

Hmm, under this aspect it might be suboptimal that Ralfʼs blog entry, at the bottom of the page, specifically links to this forum post for leaving comments (without qualifying any need for focusing on Rust in particular).

2 Likes

My intended point is that if there are many high-performance computing tasks that don't require generalized pointer arithmetic and pointer-to-integer conversions, they could be better handled using a language that simply doesn't include such concepts, than by having a language include syntactic constructs to perform such operations but then imposing all sorts of semantic restrictions on their use. I don't know if Rust would be suitable for use as such a language, but since this is a Rust forum I thought it might make sense to try to define things in Rust terms to the extent I understand them, since among other things people might be able to set straight any misunderstandings I might have about the language.

If a language supports arbitrary conversions between integers and pointers, it will be impossible to characterize all actions as having behavior which is definitely defined or definitely not defined unless one defines so many things as to preclude many sometimes-useful optimizations; even trying to come close to characterizing all actions as definitely defined or definitely not defined would require massively complicating the language and compilers for it, in order to ensure that all corner cases are handled correctly.

Further, many useful sorts of diagnostics like array bounds checking require that an implementation be able to find information about an object given a reference to it. If a language would allow a pointer to the second element of an eleven-element array to be passed to a function that expects to manipulate a ten-element array, such a function would have no way of determining what range of indices it should be able to use.

If the tasks which are most performance sensitive can get by with a very limited range of pointer operations, and those which need to have more flexibility with pointers aren't as performance sensitive, using separate languages or identifiably-different dialects may allow both needs to be met more effectively and more easily than would be possible using a "universal" dialect.

Based on a rough understanding of what you're saying -- you're basically describing how Rust normally works . Safe rust has NO UB by following a stacked model of borrowing, which was formalized and proved correct by Ralf here. The answer Rust is built upon is that the answer to this question is "rarely," so long as building blocks are provided.

Building something like a mutable linked list would require access patterns that don't particularly fit the stacked-borrow model, since list nodes are mutable, and the creation and use of pointers to them wouldn't fit the LIFO pattern required for stacked borrows, but I wouldn't think it should require "unsafe" code. After all, if one had a large array of ListNode, and links were represented by indices instead of pointers, then the fact that multiple integers used as indices might hold the same value, and thus identify the same node, wouldn't pose any sort of problem. A compiler could easily know that accesses to AllListNodes[i] and AllListNodes[j] could interact if and only if i and j could be equal.

From an implementation perspective, an aliasable top-level-object pointer could be viewed an opaque type which happens to be stored as an integer, but the only operations that could be performed on it would be assignment, comparison, and use to borrow an element of AllListNodes, along with ways of "creating" and "deleting" objects [returning the index of an unused element, or adding an index to the pool of available indices].

Note that even if it would be impractical for a compiler to identify all circumstances where such a "pointer" to an object could outlive the object itself, restricting pointer manipulations in this fashion would make it reasonably practical for a diagnostic implementation (or even a production one in cases that aren't too speed-critical) to guarantee that all improper usage scenarios would be trapped, by preceding each top-level object with a header containing bookkeeping information which could be updated when borrows are taken against it.

Please, I suggest familiarizing yourself with Rust before accidentally making a fool of yourself on a Rust-centric forum. Everyone else here is intimately familiar with Rust (at least the safe, nonformal parts).

My life philosophy is that one learns more from being wrong than being right. If there's something I'm misunderstanding, I'd rather be set right than stay quiet and continue in my misunderstanding for fear of embarrassing myself.

Is there an issue for that example?

1 Like

It doesn't require unsafe code. It just requires the use of "interior mutability," which is accounted for in stacked borrows. (As an implementation detail of the standard library, interior mutability uses unsafe, but only out of convenience for the compiler- it could equally well be provided as a primitive, and stacked borrows treats it as such.)

That's fine, but perhaps you should be clear about which things you are certain of and which you are intentionally being wrong about for learning purposes. This is a normal mode of communication called "asking questions." :wink:

1 Like

Moderator note: The purpose of this thread is to plan the future of Rust, not to teach the language to beginners. If you want help learning the language, please open a thread on https://users.rust-lang.org/.

A good way to get started is to read The Book, and, if anything is unclear, ask about it in the users forum. Then, once you've learned the basics, The Rustonomicon teaches about the details of the memory model and other unsafe features.

7 Likes

It's not really a problem, since I had to use an unsafe block, but there is some discussion about exactly when it is and isn't okay to overlap Cells in this issue:

1 Like

Rust does this. It helps to think of it as two coexisting languages: "safe Rust" does not have pointer-to-int conversions (or more accurately, you can turn a pointer into an int and vice versa but the values you get from this can't be used for dereferencing), and "unsafe Rust" is much more similar to C with lots of narrow-contract operations that can cause UB and a flat memory model (in fact, it's even more permissive than C when it comes to things like type punning). The central thesis of Rust is that most code can be written in the safe part, where we have strong theorems relating to memory safety, and the unsafe part requires manual auditing, but can be used to build safe interfaces so that upstream code does not have to worry about this.

The reason Ralf is thinking about pointer provenance and such is because even though safe Rust does not have integer pointer conversions, unsafe Rust does, and the operations you can do aren't much different from C. So we're left with the task of determining when exactly these sorts of operations are defined, which means dealing with pointer provenance if we want to lower to LLVM (which has pointer provenance). Personally, I think it's okay to be fairly lax on these matters and just say that inttoptr(n) is a wildcard pointer that can access anything any other pointer could access, because unsafe code can also use references to get a much stronger aliasing guarantee. This brings unsafe Rust closer to the "high level assembly" kind of C where more things are defined and less optimizations are possible.

Rust solves this using slices, &[T], which are a pointer-length pair. I encourage you to read up on them. There is no need for pointer arithmetic in order to handle these use cases.

Rust does this. It's called Miri, and Ralf is also the author and principal maintainer of that. It's main purpose is to be an interpreter for Rust code that can catch all UB, even in unsafe code, by literally representing the borrow stack and other abstract-machine-only concepts as data structures. It's not easy to do the same thing for C because there are many instances of UB that require the abstract machine to execute all possible nondeterministic paths on some operations, and I wouldn't be surprised if there are some cases of UB that are formally undecidable.

6 Likes

The reason Ralf is thinking about pointer provenance and such is because even though safe Rust does not have integer pointer conversions, unsafe Rust does, and the operations you can do aren't much different from C. So we're left with the task of determining when exactly these sorts of operations are defined, which means dealing with pointer provenance if we want to lower to LLVM (which has pointer provenance). Personally, I think it's okay to be fairly lax on these matters and just say that inttoptr(n) is a wildcard pointer that can access anything any other pointer could access, because unsafe code can also use references to get a much stronger aliasing guarantee. This brings unsafe Rust closer to the "high level assembly" kind of C where more things are defined and less optimizations are possible.

For how many purposes would performance be unacceptably reduced if pointer provenance could only be tracked through "unsafe rust" parts of the code in the presence of directives saying that even within the unsafe part of the code, the pointers would only be used in very limited ways?

The most common place I've seen where programs use pointer-to-integer conversions is for a construct like: alignedptr = (void*)((((uintptr_t)ptr-1) | 15)+1), but such a conversion could be avoided if the code were written as: alignedptr = (void*)((char*)ptr + (15 & -(unsigned char)ptr));. Writing the code that way would make clear to any compiler that cared about such things that not only would alignedptr be clearly derived from ptr, but it would also make it easy for a compiler to see that the representation of ptr wasn't leaked.

Almost any quality language implementation will need to document a set of requirements for the environment in which it runs. If code written in languages other than safe Rust is viewed as being part of the environment in which safe rust code runs, and a safe Rust implementation specifies how the environment may interact with pointers that are exposed to it, that would make it easier to specify the behavior of the Safe Rust abstract machine. Optimizations would be limited in the vicinity of actions which cross the boundary between Safe Rust and other languages (such as C or Unsafe Rust), but opportunities for optimizations within Safe Rust code would be enhanced.

Having clearly demarcated sections of a program, and specifying the required relationship between abstract and physical machine states at the boundaries would avoid the need to have one abstraction model which simultaneously gives programmers the freedom perform any kind of low level tasks they may need to do, while allowing compilers to safely make the kinds of assumptions necessary for effective optimization. Although having one abstract machine for everything might seem more "elegant", trying to build an underwater exploration vehicle that can fly to a dive site would be far less cost effective than building a submarine along with a helicopter that can transport it, and a means of getting people between the chopper and the submarine when needed.

Rust solves this using slices, &[T] , which are a pointer-length pair. I encourage you to read up on them. There is no need for pointer arithmetic in order to handle these use cases.

Some languages and frameworks represent slices with [basePointer, startIndex, length] triples, rather than using a pointer to play the combined role of basePointer and startindex. While the latter approach may have allowed efficient code to be produced even when using simple compilers, the performance benefits of combining them have diminished, while the advantages of handling them separately have increased.

Rust does this. It's called Miri, and Ralf is also the author and principal maintainer of that. It's main purpose is to be an interpreter for Rust code that can catch all UB, even in unsafe code, by literally representing the borrow stack and other abstract-machine-only concepts as data structures. It's not easy to do the same thing for C because there are many instances of UB that require the abstract machine to execute all possible nondeterministic paths on s

In a language which uses (basePointer,startIndex,length) triples to represent slices, things like array bounds checking can be performed relatively inexpensively, and an optimizer that understands such things may be able to hoist bounds checks out of a loop if language semantics permit (or at least identify cases where an index can be guaranteed not to go out of bounds, and run a fast version of the loop when that is true, and a slower version for cases where a loop might exit early but the index will go out of bounds if it doesn't). The performance costs for guaranteeing that out of bounds array accesses will be trapped used to be large, but have gone down to the point that for many purposes the assurance of safely-trapped behavior would be worth the cost. In fact, if one is willing to accept relatively loose semantics with regard to precisely when a trap occurs, automatic array-bounds checking can sometimes be cheaper than manual checks.

As for reliably detecting Undefined Behavior in C, that would be fundamentally intractable even if one were to fix all of the ambiguities in the Standard about what is and isn't UB, and add constructs to accomplish all of the things that are presently possible only by exploiting the fact that some implementations define behavior in more cases than mandated by the Standard. In fact, the problem will be intractable in any languages which allow arbitrary conversions between pointers and integers, which is why CompCert C disallows such things. If a program computes the integer representations of a bunch pointers, passes them to an opaque function that returns an integer, and then does something with the resulting pointer, determining whether the program invokes UB would require levels of analysis that would be intractable if a program accepts no inputs, and downright impossible if the program receives inputs that can depend upon previous outputs [since determining whether the program could produce UB in any situations which could actually arise would require knowledge of the external process that converts outputs into fed-back inputs].

Such as? What are the benefits of using the fatter representation? Note that it is not valid to go from a (ptr, len) to a (ptr-2, len+2) pair.

And so they can with just (ptr, len) as well; in fact more cheaply. In the (ptr, len) case it's just idx < len, whereas in the (ptr, offset, len) case it's idx + offset < len.

(Safe) Rust already bounds checks every array access (unless you explicitly opt out with unsafe). Please, familiarize yourself with Rust before suggesting we do things we already do.

Detect != Prove. It definitely is possible to detect UB in a specific execution path, because you can know what external stuff gives by just calling it. Miri doesn't prove your code cannot execute UB, it just executes a single execution path through your program and tells you if it executed any UB. That's as easily decidable as running the program in the first place.

4 Likes

As described in the nomicon:

Of course, a full aliasing model for Rust must also take into consideration things like function calls (which may mutate things we don't see), raw pointers (which have no aliasing requirements on their own), and UnsafeCell (which lets the referent of an & be mutated).

The important part is that Rust has more than one kind of pointer. References have strict aliasing requirements (but not Strict Aliasing Requirements in the C sense, because type punning is allowed even for them), while UnsafeCell and raw pointers do not.

In other words, everyone else here already knows that there's value in having unsafe code be subject to less strict aliasing guarantees than safe code is. That's why raw pointers are defined as having "no aliasing requirements" in the Nomicon. The difficulty is in defining the correct restrictions on converting between raw pointers and safe references. Such conversions have to happen in order for Safe Rust and Unsafe Rust to interoperate, and the purpose of having a single abstract machine for both of them is to use it to define what those conversions do.

The stacked borrows texts define what the expected behavior is when converting between these four (two safe pointer types & and &mut, and two unsafe ones &UnsafeCell and *mut / *const) pointer types. To quote it specifically:

Just like mutable references, raw pointers get added to the borrow stack. The idea is that an XYXY is allowed when both X and Y are raw pointers, but if either one of them is a mutable reference, we want that to still be a violation of the stack principle. This way, when performing analyses for compiler optimizations, we can be sure that mutable references are never part of an XYXY pattern of memory accesses.

In other words, raw pointers don't have strict aliasing requirements unless they are being aliased with safe mutable references, because if it isn't that way, safe mutable references couldn't be optimized.

2 Likes

To me, the fundamental question, the reason why inttoptr is hard, is figuring out why the code has permission to access the data. If you have a slice of bytes and want to build a pointer with access to a suffix of that slice, you can do that with &p[n..] in safe rust. So for your example, if ptr: &[u8] we could write roughly let alignedptr: &[u8] = ptr[15 & -(ptr.as_ptr() as usize)] and this is now entirely safe code which produces a slice pointer that is 16 byte aligned (although this is pretty weird code to need to write; generally you would just use an aligned type).

This is false, because this is exactly the main new thing that Miri is able to check. It's not just checking for "signed overflow" or equivalent local UB (that's not UB in rust but you get the point); it's checking for things like when you transmute an integer to a pointer and dereference it, you might cause UB if you read where you aren't supposed to, and Miri is able to reliably tell whether or not you have caused UB because it knows where all the allocations are, and more to the point, which allocations (and which individual bytes in those allocations) have been marked as safe to be read from pointers like the one you are using to access the data. This requires a lot of bookkeeping so it's quite slow, but it's just O(1) overhead on regular execution, it's not exponential time, so I wouldn't call it "intractable".

The actual problems with a Miri for C have more to do with nondeterminism. C has lots of unsequenced points in it, a sort of local parallelism in the semantics for no particularly good reason (it almost never translates to actual parallelism but it licenses some function reorderings by the compiler). To properly catch UB in this situation, you have to analyze all interleavings (and even the non-interleaving versions, where causality is a partial order, not a total order). This is worst case exponential, because UB can be caused in any of those interleavings.

Note that Miri is a dynamic UB checker. It does not statically analyze the program for UB, which is actually undecidable for all but the most trivial languages (which are either not turing complete or have no UB). There are tools like ubsan that are dynamic UB checkers for C but I don't think there is any claim of completeness in these tools.

2 Likes

Such as? What are the benefits of using the fatter representation? Note that it is not valid to go from a (ptr, len) to a (ptr-2, len+2) pair.

Any code that works with array segments on any JVM or .NET-based language without using unsafe code will need to keep track of the starting index separately from the array reference. Among other things, this makes it easy to keep array references attached to information such as the array size, element type, a lock-identity token, etc.

Note that this design also both JVM and .NET offer memory safety guarantees which cannot be broken even by multi-threaded code which lacks the barriers necessary to ensure correctness. If an object contains a memory reference which is written in one thread while it is being read in another, the read is guaranteed to yield either a valid copy of the old reference or a valid copy of the new one, and is guaranteed not to yield a reference to any storage which could not be accessed through either reference. If one attempts to copy an (arrayRef, index, length) triple while it's being modified, the items in the triple may have no meaningful association with each other but memory safety wouldn't be dependent upon the offset and length fields of the triple. If x is a ten-element array and code tries to copy a (firstElementPtr, length) pair when its value is being changed between (x+1,9) and (x+2,8) or vice versa, it may be hard for an implementation to guarantee that any attempts to write storage beyond the end of x will get trapped.

Further, if an implementation wants to produce different versions of a loop for scenarios where things may or may not alias, being able to treat the ability of array references to alias as an equivalence relation can greatly simplify that. When accessing arrays directly, the only way a[i] and b[j] can identify the same storage is if both a==b and i==j. When accessing them through array-segment wrappers, other forms of aliasing become possible, but a compiler would be able to check whether (made up names for hypothetical user-code wrapper) a.backingArray == b.backingArray and whether, for any i and j of interest, a.startIndex+i could equal b.startIndex+j.

And so they can with just (ptr, len) as well; in fact more cheaply. In the (ptr, len) case it's just idx < len , whereas in the (ptr, offset, len) case it's idx + offset < len .

Array-segment wrappers can use either (ptr, start, end) or (ptr, start, length). I personally prefer the latter, but the former design seems more common. The cost of the arithmetic, however, is trivial on modern high-end CPUs where performance will be dominated by memory access.

If one doesn't care about the kinds of memory safety assurance that the Java/.NET approach facilitates, or the aliasing-analysis advantages it can provide, then the (elementPointer+index,length) approach can offer somewhat faster code. Its performance advantages, however, are nowhere near as great as they once were, and the threat level from deliberately-malformed data has massively escalated in the last few decades.

Detect != Prove. It definitely is possible to detect UB in a specific execution path, because you can know what external stuff gives by just calling it. Miri doesn't prove your code cannot execute UB, it just executes a single execution path through your program and tells you if it executed any UB. That's as easily decidable as running the program in the first place.

If a program performs an integer-to-pointer cast on a number that just so happens to match the address of something useful and accesses the object in question, such an action would have defined behavior if and only if that was the only possible location for the object that would be consistent with already-observed behaviors. If e.g. a program computes uintptr_t upn = (uintptr_t)&n and then outputs upn & ~0x100, then it would be possible for a program to determine that n must necessarily be in one of two places. If it had no way of knowing which of those places was the actual address of n, nor what was at the other, casting one of them to a pointer and accessing it would invoke UB even if the most likely consequence would be for the implementation to behave as though the program somehow knew the address of n. If, however, the program knew on the basis of some other observations that ``upn matched ((uintptr_t)&m) & ~0x100, then it would be entitled to know that an converting upn to a pointer and dereferencing it would access either m or n, even if it had no way of knowing which.

Trying to keep track of everything that a program might know about addresses is essentially impossible, but if an implementation doesn't do so it will be impossible to 100% reliably distinguish programs that work entirely as a consequence of defined behaviors, versus those that work only as a result of luck.

Do you have confidence however that this way of learning is not happening at the expense of others and not overstepping the boundaries of their good will?

4 Likes