Pre-RFC: raw pointer cleanup


#1

Hey all. I’ve had this on the back-burner for a while. Since I’m disappearing, I figured I’d just drop this here for people to discuss, and maybe for someone else to pick up.


Summary

ptr::Unique -> ptr::Owned

add ptr::Shared

add many ptr fns as methods on raw pointers

Motivation

Raw pointers are… annoying… to work with in Rust. Not necessarily for good reasons. The best thing about *const and *mut is that they exactly mirror C pointers, which I’m told is really important for FFI.

However there are other reasons to use raw pointers: when trying to model stuff that is indirected, but where references are inappropriate. However several peculiarities of raw pointers make them incredibly frustrating for this usecase.

*mut T is invariant over T

Even though *mut is the natural pointer to use to model an owned value, its invariance interferes with this. Most owning things want to be able to mutate the data they own (in which case *const doesn’t work), but are sound to be variant over the data they own because mutating operations only happen through an &mut self, which enforces invariance as appropriate.

For the most part, I believe this to just be a papercut, though. I don’t think many people would run into much trouble if their custom container/pointer types were needlessly invariant. Also since you can get an &mut T out of an &*mut T, this is technically the right default. You definitely don’t want to be incorrectly variant.

However invariance is a “no going back” thing. Although you can opt out of variance, you can’t opt out of invariance once one of your type parameters establishes this (variance computation is like a boolean that gets AND’d together).

To work around this, you need to store a *const T and cast it to a *mut T whenever you want to do mutations.

Raw pointers do not claim to own their referrents

This is a problem for dropcheck. If you contain a *mut T or *const T which points to data that you own (read: that you will drop), dropcheck needs to know that to do its job. If you do own your referrent and don’t claim it, it’s possible to create unsound Drop impls.

This situation is the polar opposite of the variance issue: rather than being a restrictive but safe default, it’s a liberal and unsafe default. However this behaviour is motivated by the desire to not be an inescapable trap like the invariance of *mut T is. The only tool we provide for specifying ownership of T is PhantomData<T>. Like invariance, once you’ve established ownership of a type, there’s no way to undo this. There’s no way to use PhantomData<!T> or something to specify that although you appear to own a T, you in fact don’t.

Since it’s plausible to use raw pointers in a non-owning context, this is the most general default to take. Unlike the variance problem, there wouldn’t be a usable way to opt out of this problem like using *const T. We would have had to introduce two new pointer types at the lang level to enable opting out of this.

Raw pointers are nullable

C pointers are nullable, raw pointers are C pointers, ergo raw pointers should be nullable. However this is really frustrating to work with for a Rust programmer. These are the only nullable pointers in Rust. Admittedly, they’re also the only pointers that can be dangling, which is a much more pernicious and undetectable problem!

The biggest nuisance is being unable to efficiently talk about an absent raw pointer via Option. You can manually encode it as just “if it’s null it’s absent”, but this means that you have to remember to check, and it’s not at all enforced or represented by your types.

This is a bit of a moot point for std abstractions, which actually use 0x01 (heap::EMPTY) as the sentinel for “absent” so that they themselves can be null-pointer optimized (Box, Rc, Arc, Vec, HashMap, etc…). Although it is perhaps worth noting that this sentinel value is never (to my knowledge) checked. These types always check whether they have a valid pointer using other state. Whether it be size_of::<T>() == 0, len == cap, or cap == 0. heap::EMPTY is just an agreed upon garbage value to put there that isn’t null.

Raw pointers aren’t Send or Sync

Another case of a curious default. This one can actually be overridden – you can impl both Send and !Send as appropriate. Since raw pointers are just integers, they are in principle trivially Send and Sync. Not being Send or Sync is actually basically a lint. If you’re doing stuff with raw pointers, it’s non-trivial and you may not have thought about thread-safety. So in order to force you to consider thread safety, any type that contains raw pointers must manually impl Send or Sync.

Raw pointers often require importing ptr

This is another paper-cut. Having to use ptr::read(ptr) or ptr::write(ptr, val) is just kinda annoying. Free functions are exposed because this allows passing & or &mut to them and having them coerced as appropriate.

Historically this also hacked around the fact that doing anything other than directly re-exporting an intrinsic severally penalized the compiler. Literally a trivial wrapper function around the intrinsics would make it fall over.

Unique

ptr::Unique is the solution to many of these woes. It’s defined as follows:

struct Unique<T: ?Sized> {
    _ptr: NonZero<*const T>,
    _boo: PhantomData<T>,
}

And exposes the following functionality:

impl<T> Deref for Unique<T> {
    type Target = *mut T;
    // ...
}

impl<T: ?Sized + Send> Send for Unique<T> {}
impl<T: ?Sized + Sync> Sync for Unique<T> {}

Semantically, it’s specified to behave as if you literally contain a value of type T in your struct – which as “merely” an implementation detail is indirected to the heap. It’s not clear to me if the data must necessarily be on the heap, though it seems like that’s the only way to properly resolve its semantics (in particular, Send and Sync).

Consequently, you get the following behaviour:

  • variant over T (yay!)
  • claims to own T (yay!)
  • derefs to *mut T, so unique.offset(idx) produces a *mut T (yay!)
  • non-null in a way the language understands (so null ptr optimizable) (yay!)
  • derives Send and Sync as if you contained a T (yay!)

However it exacerbates the ptr::read problem. Now you need to ptr::read(*self.ptr), which is incredibly confusing.

Also, while it’s perfect for Vec and Box, it’s semantically inappropriate for Rc and Arc (which do not uniquely own their data, but rather share it). In practice the only thing is does wrong is derive Send and Sync (which e.g. Rc could opt out of). However it is in principle desirable to perform alias analysis to Unique: we should be able to use the fact that Box, Vec, etc contain a pointer to some uniquely owned data on the heap. In LLVM parlance, pointers derived from a Unique can only be aliased by other pointers derived from it. We do not currently provide this information to LLVM (to my knowledge).

Finally, Unique is a bit of a confusing name, as it suggests a stronger claim than is really intended. It is not that this is a unique pointer to that data (it’s fine to take references into it), but rather that it is the owner of that data. I personally was very confused when I first encountered Unique because of this. I wasn’t sure if some things could be marked as Unique because they didn’t seem to be unique in practice. But that wasn’t the issue at hand. What mattered was that the pointers were the only owner of the data.

Detailed Design

To better resolve all these issues, this RFC proposes the following changes:

Rename Unique to Owned

This, in my mind, provides a better intuition as to the meaning of this type.

Add ptr::Shared for Rc and Arc

At worst this is just a nice de-duplication of work between Rc and Arc. Note that this is blocked on the fact that PhantomData prevents DST coercions of Shared<T> or Unique<T> (which would break Rc and Arc). Box dodges this by being legitimately magic, and having its definition totally ignored by the compiler for DST stuff. I have a PR and RFC open for this.

Add ptr functions as methods on raw pointers

In particular:

  • read, copy, copy_nonoverlapping on *const and *mut
  • write, write_bytes, replace, swap on *mut

This enables the following expressions:

ptr.offset(idx).read();
ptr.offset(idx).write(elem);
unique.read();

This requires less imports, is more pleasant to read, and also avoids the nastiness of manually derefing a Unique (or Shared) to be passed to read or write (it genuinely looks like a normal ptr).

Drawbacks

Duplicated functionality in functions and methods for the ptr stuff.

Shared seems to have limited scope – worth adding just for Rc and Arc?

Unresolved Questions

Is the intrinsic wrapping problem still a thing? Will moving over to methods on raw pointers cause serious regressions in compile time or codegen quality?

Can anything else use Shared?

Do FFI authors want some types with better semantics (non-nullable, owned, etc), or is *mut and *const “good enough”? acrichto and sfackler seemed to think just mirroring headers was the right way to go.


#2

Probably a naive and not-new question but: can either Owned or Shared dangle? If not, how much difference is there between Owned and &'static mut, resp. Shared and &'static? (Obviously you couldn’t use the former in place of the latter – but what about vice versa?) I can sort of guess the answers but would still be interested to read them.


#3

@Gankro Am I correct in saying that the only difference between Shared and Unique is that Shared is not Send or Sync? I assume that the variance and PhantomData/dropck stuff is the same.


#4

GCC has the function attribute nonnull for some or all parameters, and Clang supports this too. So this might still fall under “just mirroring headers”.


#5

Yep! (but also I think they could be understood to the compiler to have novel aliasing semantics as well)

They can definitely dangle or point to uninit memory. Many collections store a dangling pointer in their empty state (len = 0 or size_of:: = 0). In addition, the destructor frees this pointer so it’s dangling for a little bit even without this optimization. Arguably with this reform also makes raw pointers more ergonomic than references. You can .offset(idx).read() them.


#6

&'static T implies that the data behind it lives forever, which is totally not the case with Shared<T> (and this allows evil optimizations - e.g. turning a free-after-use to a use-after-free).

&'static mut T is almost equivalent to Unique<T>, but it also has an implicit T: 'static bound, which is not desirable.


#7

@Gankro Wasn’t sure where else to put this, but you may be interested in: https://github.com/apasel422/ref_count.


#8

I don’t know if I like memcpy being a method on raw pointers. src.copy(dst, 1), and dst.copy(src, 1) both make very little sense as methods; there’s not really a good self choice. Perhaps something like src.copy_to(dst, 1), but that would be weird because you’re renaming the function.

Otherwise, I absolutely love this pre-rfc.