**Recursively** pinning memory

I have an application where I need to pin all objects in an object graph, at least temporarily, which means that Pin don't seem like it applies. My hope in this post is to raise awareness, and see if there are any solutions possible. I suspect that this would require some kind of compiler intrinsic or language keyword, like a new type of scope within which all objects are pinned.

My problem is that I want to serialize/deserialize object graphs, and have them reconstituted to be equivalent to the original graphs. For example, if I have an object graph that has cycles in the graph, I want the deserialized graph to have the same cycles. This is not what serde is currently capable of (see the docs for Arc). That said, if you can force all objects to remain pinned in memory, and you can get the address of each object, then you could create a new object that implements the Serialize, Serializer, Deserialize, and Deserializer traits that acts as an intermediate layer between what we're serializing, and the 'real' backend serializer. It would maintain a single object, the data_store: Hashmap<usize, Option<Vec<u8>>>. The keys are the memory addresses of the object being serialized, and the value is the value of the object after it is has been serialized. At this point, you can perform a simple depth-first search to store everything in data_store.

  • For non-pointer types, you store the address of the object as the key, and the value is the serialized value of the object.
  • For pointer types, you do the following:
    • Get the address of the pointer itself. If it is already in data_store, then continue, otherwise:
      • Create a new entry in data_store with the address of the pointer as the key, and Option::None as the value.
      • Recurse through the pointer, restarting this algorithm.
      • When you pop back up, replace this pointer's value with Option::Some(address of the object being pointed to)

Once you're done with this, you have an object that has no cycles as it doesn't have any actual pointers. data_store can then be fed through any serde serializer. Deserialization is just doing the above backwards. Note that the memory addresses of the reconstituted objects can be different from the original memory addresses; the only reason I needed the original memory addresses was to act as unique names for the objects that were being serialized.

This also shows how weak this is; if an object is moved while the object graph is being serialized, then its name has effectively changed, breaking all my assumptions. So, is there any way to guarantee that all objects that are recursively reachable starting at some object or pointer are pinned? Given that Unpin exists, I can't rely on Pin.

The only thing that I can think of is to change the language so that creates a new scope which marks all objects in that stack frame and below as being immobilized. So, something like the following would be legal rust:

use serde::{Deserialize, Serialize};
use serde_json::Result;

#[derive(Serialize, Deserialize)]
struct Cyclic {
    cycles: Option<Rc<Cyclic>>
}

fn serialize(c: &Cyclic) -> -> Result<String> {
    immobilize {
        serde_json::to_string_pretty(c);
    }
}

immobilize would also have to be nestable, with the internally nested immobilize scopes being no-ops. Only when the highest immobilize scope is exited will normal semantics (everything can move) be returned to.

I think you've got a pretty big XY problem. First, the notion of pinning that you're talking about doesn't have much to do with Rust's Pin type, so discount that.

However, guaranteeing that values won't move around (or be mutated in other ways) concurrent with a section of code is exactly what Rust's borrow checker provides you. Unless you're dealing with interior mutability, none of your objects will be mutated during your serialization pass. If you are dealing with interior mutability, all you need to do is recursively lock everything first. It's already the case that there's no way your Cyclic type or any of the things it points to could move during that call.

I don't really understand your problem (and don't have time to dig deeper) but I sort of suspect there's an easier way to "serialize an object graph including cycles" than recording the addresses of all the objects in the graph.

3 Likes

OK, looking at the docs for Pin:

Types that pin data to its location in memory.

It is sometimes useful to have objects that are guaranteed not to move, in the sense that their placement in memory does not change, and can thus be relied upon. A prime example of such a scenario would be building self-referential structs, as moving an object with pointers to itself will invalidate them, which could cause undefined behavior.

That is exactly the behavior I want for all objects in a given object graph.

My understanding is that objects that are marked Copy could be copied to new locations by the compiler. This would make pointers potentially invalid, correct? Or am I completely missing the point, and worrying about something that I don't need to worry about?

That's because I made a terrible mess of my explanation... sorry! :sweat:

There might be, but part of what I'm trying to do is make something that is compatible with serde as it is. I'm not sure if I'll ever complete it, but if I do, I'd like to contribute it back to the serde project, so that anyone that wants to can deal with arbitrary object graphs, kind of like how python's pickle can handle arbitrary object graphs. For me personally, the fact that serde can't handle arbitrary object graphs is a continuous annoyance (although I love everything else about it!)

I'm pretty sure this part is a red herring, since moving a value of any type might also copy it to a new location in memory (even if most moves get optimized away in practice).

The difference between Copy and non-Copy types is just that when a move/copy occurs, the borrow checker assumes the old/moved-from value has been invalidated and is no longer legal to use, unless it's a Copy type.

I've never worked with serde or non-trivial serialization work before, but the first thing that came to mind for me is simply forgetting about the memory addresses things have in your program at runtime, and instead explicitly assign every object a unique id as part of your serialization (which ought to be more consistent and deterministic anyway). Or is there some design constraint of serde that wouldn't allow that?

Yeah, looking at just that quote would give that impression. But if that was really the full story behind pinning, then the existence of Unpin and the fact that Pin "isn't recursive" wouldn't make a lot of sense.

I think what withoutboats was getting at is that Pin is actually about guaranteeing a type will not move if moving would somehow invalidate it. Or to put it another way, Pin<Box<T>> means "the T shall never be moved unless the compiler can prove it's safe to move (via Unpin impls)". It's really about a higher level semantic safety guarantee than that involves, but is not equivalent to, memory addresses stability. That's why it's not relevant to your problem (in addition to the fact that it's likely an XY problem anyway).

2 Likes

The issue is that you need a place to store the unique identifier. If you've got a pointer to a u8, where are you going to store the unique identifier? The memory address uniquely identifies the object, and doesn't require any additional storage, which means that user code and objects aren't being modified (could be important if someone expects memory to be laid out in a particular way).

OK, now I get what you guys are saying, and you're right that it is an XY problem. I knew that Pin won't work recursively, but I didn't know that you could still have pinned objects moved provided it won't invalidate them. All that gets back at what I was talking about; I want an immobilize scope, that guarantees objects won't be moved or copied while you're in the immobilize scope. That will ensure that you can serialize a graph.

1 Like

You don't care whether things move, all you care is that an address is not reused for a new and different object while serializing, leading to a false hit in the deduplication dictionary.

An &T is enough to do that without internal mutation; with internal mutation you need to either lock all mutexes at the start, or more practically keep Weaks to Arcs/Rcs in the dictionary instead of just addresses.

Also this probably should be posted in users.rust-lang.org and not internals.rust-lang.org

4 Likes

You already have a guarantee that your objects won't be moved, because the existence of a plain old &T (or Rc<T> or any other shared pointer type) already enforces that.

You don't need to guarantee that objects won't be copied, because the copy won't be reachable from the object graph you're serializing. The only way for it to become reachable is if you mutate an existing part of the object graph to insert it.

So all you need is a guarantee that your object graph won't be mutated while you're serializing it. As @bill_myers points out, &T also already guarantees this for objects without interior mutability. This leaves two possibilities:

  • Either some object's Serialize implementation mutates it,
  • or else the object is mutated by another thread (e.g. it's in a Mutex).

Both of these seem so rare as not to be worth worrying about in the general case. They affect non-cyclic serialization just as much as cyclic serialization. Because they don't affect soundness (just correctness), you could just make a note in the documentation of your wrapper layer that it will behave strangely if you do this.

1 Like

It seems to me that the point in serialization is to be able to rebuild the same structure again in another place (or save and reload) with no discernable difference between the data. you are never going to be able to grab the same set of addresses on another system

That being the case, the only way to keep that graph consistent is to use unique id's and have each pointer replaced by that (at least at serializatron time)

This feels to be like a case for some ECS tool. It might be a bit overkill for your needs but have you checked out the specs crate

1 Like

More to the point, it makes that guarantee "forever." You already can guarantee that an object won't move, or mutate, for a given lifetime 'a by having a &'a reference to it.

This user would need to have really specific requirements for normal references not to work - they would need to have an object graph using interior mutability with concurrent shared accesses to the graph going on. In which case well, yea, you can't really expect to serialize something that youre simultaneously mutating without some sort of synchronization.

5 Likes

OK, thank you, that is the part that I wasn't getting. My understanding of move semantics included the mental model that the compiler could insert code to move objects at arbitrary moments in time, under circumstances that weren't clear to me. For example, it would be permitted (but admittedly hard to do) for the compiler to insert code that both moved an object and fixed up all pointers to the object at arbitrary points in time[1]. Under this mental model, traversing a graph would fail. However, if rust already guarantees that for the lifetime of any pointer that the pointed to object will not be moved, then that takes care of my concerns.

Yes, but this was really a problem with my mental model of how rust works. Since I thought that move semantics were broader than they apparently are, I thought that an immobilize scope was required, which is why I brought it up here.

I agree with @rpjohnst about putting a note in the documentation about not mutating while serializing or deserializing; there really isn't any other way to do it. While I think I can see a technique that would allow you to grab all of the mutexes without causing a deadlock, I'm not willing to bet my code on it working in all cases.

[1] I'm not saying that this would make any sense to do, only that it would be allowed under my mental model of how move semantics work.

In this case, the memory address is the unique identifier. The reason to use it is because:

  • In a flat address space (I'm assuming that rust will never target a non-flat address space), all memory addresses are unique already, which means you don't need to generate your own unique ids.
  • It's fast. Since nothing extra is generated, you don't have to waste any time creating stuff that is only used in serialization.
  • There is no additional storage required; the object's address is always associated with it. This is important both to reduce your memory footprint, and to ensure that if you need to have an object with a particular memory layout (e.g., you're trying to cross into C/C++), you don't run into any problems.
  • There isn't any problem with Copy objects. That is, you don't have to worry about two objects having the same identifier because you forgot to mutate an identifier after you made a copy of it.

Note that you won't be trying to recreate objects at the same memory addresses as they originally were at; by the time you're deserializing, the original memory addresses have been reduced to unique names only, and the newly created objects could be anywhere in memory.

3 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.