Request for Critique: Align TypeIds to CBOR tags - CONCLUSION: Bad idea

I had what is either a good or bad idea this morning, and I honestly can't decide which this is. I'd appreciate feedback on the idea.

TypeId is currently an opaque type that is (again, currently) just a wrapper around a u64. The value of the TypeId for a given type can change between compiler versions. This means that if you want two chunks of code to communicate a TypeId between each other, you first need to make certain that they were both compiled with the same version of the compiler. While this is possible, it means that TypeIds are less useful for things like serialization1 or interoperability between languages. There are probably other places where stable values for TypeIds would be useful.

This is where my idea comes in. The IANA maintains a type tag registry for CBOR, with tag values all fitting within what a u64 can hold. I'd like it if the TypeId values for types that map directly to anything that is currently in the registry be set to those tag values.

For everything that doesn't match something that is currently in the registry, we can do the following:

  • Register the TypeId type as a new tag value with the IANA. This will hold up to a 64 bit value, and allows the compiler to create values independent of what is registered with the IANA.
  • Create a method of setting the TypeId value for an object (possibly a new attribute that can be added to types).
  • A method of determining if a given type's TypeId was set (implying it was registered with the IANA), or arbitrarily chosen by the compiler.

The first point is necessary because if rust adopts the idea of keeping in sync with the IANA, then every time anyone creates any new type they would need to register it with the IANA or risk the TypeId they choose getting assigned to a different type by the IANA. By creating and registering a new TypeId value (maybe with the version of the compiler that produced it? I'm open to suggestions on what should go in here!), we can create values that are entirely independent of the IANA, and which never conflict with any registered values.

If a type is sufficiently important that someone is willing to register it with the IANA to get a permanent value for their TypeId, then there will need to be a method of setting the TypeId for that type. Registering with the IANA implies that the registrant expects their type is going to broadly useful and of sufficient importance that it's worth the effort of registration.

Obvious questions/concerns

What prevents people from deliberately munging with their TypeId values to cause conflicts?

Nothing. I don't know if this will be a security issue or not, and it's something that would need more knowledgeable people than me to decide this.

OK, I get the idea, but is this useful?

Seriously? I don't know. The rust compiler has no direct concept of serialization, so I don't know if following this proposal would needlessly complicate the compiler, all without providing a seriously tangible benefit. Like I said at the start, I can't decide if this would be a good idea, or a bad one, and I need help deciding that.

Thoughts?

1I'm well aware that there are a vast number of serialization crates out there that have already dealt with this problem. Part of the reason I'm throwing this idea out to everyone to critique is to see if this idea would improve the landscape for everyone.

Final conclusion

For those that stumble on this and don't want to read the whole topic:

As noted in this post, if TypeIds are given fixed values, then it becomes a soundness requirement that exactly one crate in the universe be permitted to use that value. Otherwise, if multiple crates use the same value for their own version of a given type, and someone compiles two or more of these crates into the same binary, then there is an implicit transmute() buried within the code. This may or may not affect the behavior of the code (depends on the code involved), but implicit unsafe code is not what Rust is all about, so the idea is a bad idea.

Not only TypeIds are unstable between compiler versions, types are, too. A struct Foo in one version has different layout from the same struct Foo in another version; and these are vastly different from struct Foo in a different, but semver-compatible, version of the same library (which might add fields, etc.)

AFAIK, TypeId’s main point is to support downcasting from dyn Any. For this, there’s no point in communicating TypeIds between different software, or between instances of the same code compiled with different compiler versions where layout changes mean that even this main point of TypeIds no longer applies.

4 Likes

I don't think TypeId is the correct mechanism for this. Most importantly, why CBOR's? One could equally argue for protobuf's numbers, say.

1 Like

Is there any code that relies on TypeIds being a u64?

Could we make type ids to somehow correspond to structure of a type they were taken from?

Some tree like data structure describing a type hashed into 256b string?

Funnily enough, I’ve only recently personally thought about whether something hashed-based (representing a tree) could work as TypeId while imagining whether TypeIds could potentially support being generated even at run-time somehow in a (hypothetical) setting where Rust supports polymorphism.

I don’t see how any of this addresses the point that these things are still going to be depending on the version of the library, or how this could have anything to do with any external ”type tag registry”.

1 Like

can we make an API with blanket const impl for all types and then use specialization to bring it to run time?

I don’t understand this question :sweat_smile: – where does specialization come into play?

Rust's internal layout is going to be different from CBOR's layout no matter what happens; I'm not suggesting ABI compatibility here, just semantic stability.

Yeah, that's part of what I was trying to 'solve'. I'm trying to figure out how to make TypeId more generally useful than just for downcasting. It feels like this should be possible, but I feel like I'm not quite smart enough to explain what those uses can be.

Why not?

Mainly because it's backed by the IANA, which has a pretty track record of dealing with these kinds of registries.

Sounds a little like a Merkle tree, although I think that would be overkill for this application.

Like, we do a blanket const impl<T> TypeIdGenerator for T implemented via intrinsics, and specialize it into a concrete one impl TypeIdGenerator for _

How would you resolve multiple libraries that export their own types corresponding to one CBOR type? As a very simple example uuid v0.7 and uuid v0.8 would both be exporting a type that would be tag(37).

2 Likes

Ah, so you were referring to “specializing” a const trait implementation with a non-const one… at least I now understand your wording. I still feel like that’s now quite the right phrasing for something like this, but don’t feel like discussing this in-depth in this thread, as it seems to be going off-topic w.r.t. what OP proposed.

That’s basically what I meant, yes. The only application I had in mind was run-time generation; since that wasn’t your application, it’s unsurprising if it feels “overkill” :slight_smile:

There shouldn't be. If there is, they're relying on what is currently an implementation detail (the docs are very clear that it's an opaque type).

What would be hashed in this case? Genuinely curious, everything that I can think of hashing could be attacked to create an identical hash but different semantics.

1 Like

As long as the layout and semantics are the same, then it wouldn't be a problem. The only time it becomes a problem is when the semantics differ from the defined semantics, or when the layout changes.

struct Uuid(Vec<u8>); and struct Uuid([u8; 16]); are both valid representations of tag 37, but they definitely can't have the same type id.

3 Likes

I'm sorry, I didn't mean to make it sound derogatory. My brain was thinking about how to make this fast, but since this code is unlikely to be called often (and it can be cached after its called), the costs are actually fairly negligible. The only real issue is how to capture semantic changes that don't affect the name or layout of the types involved.

Absolutely true, so one of those would be chosen as the 'canonical' type, while the other would have to have a different TypeId (or a different tag).

How do we chose this type? How do we chose tags for the new types? Moreover, we use TypeIds to support dyn Any, so, likely, we also need to include some location info about the type...

I think that full path to a type + version of crate (semver concerns...) need to also be included into TypeId.

My thought for this was to define a new enum similar to the following:

#[non_exhaustive]
pub enum TypeIdWrapper { // I'm not good with names, someone bikeshed this
    Arbitrary(TypeId),
    Tag_0, // The trailing number is the value of the tag in the IANA registry
    Tag_1,
    Tag_2,
    // So very, very many tags
}

The Arbitrary variant is unrestricted, and does not map to anything currently registered in CBOR or anywhere else. It's whatever the compiler is doing currently, and it can change from compiler version to compiler version. If we registered the TypeId tag type with the IANA, then this variant would actually be removed and replaced by the new Tag_XXXX variant that the IANA sets.

All other variants map to whatever is currently registered with the IANA. At the moment there are around 150 tags that are assigned, so while this enum would be largish, it wouldn't be unimaginably large. As new tags are defined, they can be added to the enum without breaking semver (if I understand now semver and #[non_exhaustive] work, please correct me if I'm wrong!).

If we're talking about using hashing, then yes, that would be a good idea. However, if we're using manually assigned attributes, then that isn't required. Either the identifier is assigned arbitrarily by the compiler, or the end user would assign it via an attribute.

I'm sure that there are issues with the mechanics of this idea, but that's kind of where I'm heading right now.

Setting aside that IANA would probably not be amused by millions of registration requests :slight_smile: I generally prefer a much less centralised approach. The issue you want to solve — namely identifying the structure and meaning of some data within a distributed system — is rather similar to what unisonweb is doing; they take it one level further and apply the principle to code. The trick to avoiding a registry is to use hashing and content addressing. The hard problem this poses is to find a canonical representation so that inconsequential variations of local representation do not change the hash (like how I want that structure laid out in memory — network peers care only about the wire format; sending the memory layout over the wire is a doomed concept, given that there exists more than one CPU architecture).

Rust’s TypeId has a very narrow and well-defined purpose, so I‘d ask you to leave it as is and create a new concept, because what you want is outside TypeId’s scope.

4 Likes