On native allocations pointing into a GC heap


#1

Motivation for this post: @zwarich asked a good question here, about potential API’s for having allocators that need not be tightly integrated with the garbage collector:

https://github.com/rust-lang/rfcs/pull/244#issuecomment-56929140

I attempted to provide a complete answer there. But there are some extra details that I think are relevant to the topic, yet I did not want to make my response comment too large.

Therefore, I thought I would put my more complete thoughts here, and hopefully they might be revised over time as our vision for GC becomes more complete.


Conceptually, I think of the memory in a garbage-collected runtime as being partitioned into three or four kinds (I’ll explain the “or four”).

  1. The first is GC-managed blocks of memory; they are allocated by the collector and it is the responsibility of the collector to trace them (and reclaim them when possible).
  2. The second are non-GC-managed blocks of memory that contain no pointers into the GC-heap; these are manually managed (explicit free’s or other reclaiming scheme) and need no integration with the collector.
  3. The third is the stack, which, with the program registers, makes up the obvious root set used in standard collection schemes.
  4. The fourth is non-GC-managed memory that can contain pointers into the GC-heap; these are also manually managed, but still require GC integration (especially if you want to trace such things precisely)

(If you squint your eyes. the third case could be considered just a special case of the fourth. But for a moment, let us treat them as distinct cases.)

The question is: What do you do about that fourth kind? Even assuming that you have stack maps that tell you which stack slots contain GC ptrs (and perhaps even what their precise types are, so that you can derive type-specific tracing methods), those stack maps will not include the fourth kind of memory.

  • So, do you try to find paths from the roots (the so-called “third kind” above) to blocks in the fourth kind? Note that the reason to do this is not because one thinks that the blocks may have become unreachable, but rather because one can derive from such paths what the precise type to use when tracing the block of memory.

  • Or do you register each such block itself as a root (and then either scan it conservatively, or include with the registration a type, or a description of its layout, or a tracing hook)?

I think the strategy of trying to find a path from a root in the application code to a native block is perilous; to me, part of the point of being able to allocate native memory is precisely so I can hand it off to some native library (perhaps temporarily), and not have to deal with in my own application code.

(Of course, one could argue that the list of registered blocks is itself acting as a very direct path from a root (the registry itself) to the block. But it still seems a lot simpler than trying to force such a path to be otherwise derivable.)


#2

Would this mean that storing a native allocation in a Gc'd block can cause a “reference cycle”? That is, given a set up where Native stores a Gc that points to itself, a conservative scan of the Native root would cause the Gc to stick around forever because it would be considered reachable via Native even if Native itself isn’t reachable.

+- <some memory>
|        ^
|        |
+-> Gc<Native>

#3

My intention with the model I presented above was that categories 3 and 4 classified memory belonging to objects that were owned by non-GC data. So under that assumption (which may be too strong; see below), then the situation you describe should never arise, because Native should always have an owner responsible for free'ing it that lives somewhere outside the GC.


There are however a number of gotcha’s that I admit I had not thought about much before now.

The biggest gotcha I think is this: In my presentation above, I make it sound like every memory allocation is assigned to one of the four categories and can never change its category during its lifetime, for lack of a better word. But if you cannot change an allocation’s category, then the above scheme cannot model programs where an owning pointer to memory is transferred from a non-GC-allocated owner to a GC-allocated owner (where the GC-allocated owner would presumably have a finalizer that takes care of calling free). That sounds too restrictive to me (though its possible that one might still be able to write interesting/useful programs with such a restriction in place).


I spent some time this morning thinking about how to handle transfer of ownership, but found nothing satisfying.

(Maybe this is a sign that my attempt to avoid manual root registration is mistaken.)