Stable type identifiers

Rust Permanent/Stable Type Identifiers

I have an idea related to #3435 and #3470 that differs from both of them.

The Idea

The idea is that Rust would add permanent identifiers for every type. These identifiers would be derived from the full path of a type, i.e., the identifier for std::vec::Vec would be derived from the path std::vec::Vec and not from the internal structure of std::vec::Vec or its memory footprint, only its path.

This permanent identifier would not change when you change your job or move homes, just as a national identification number does not change when you make those changes.

For example, the permanent identifier for Vec would be found by hashing the path of the type.

hash("std::vec::Vec")                        → permanent identifier for `Vec`
hash("my_crate::MyType")                     → permanent identifier for `MyType`
hash("std::vec::Vec") + hash("i32")        → permanent identifier for `Vec`

What This Is NOT

This is not a stable ABI: The memory layout of Vec can remain unstable while the identifier that is telling you that Vec is of type i32 is stable. These two concepts exist at different layers.

Also, derived identifiers are not a derivation from the real structure of the type. For example, an identifying hash value of Vec does not change if the implementation of Vec changes.

How it Works

  • The compiler will assign the hash based on the full path to the type, and it will do so deterministically. There will be no manual hash value assigned and no central authority to assign hashes.

  • Generics are composable; that is to mean that the programmer will still write Vec as they have been doing, and from there the compiler will derive the permanent identifier.

hash(Vec)  = hash("std::vec::Vec") + hash("i32")
hash(Vec) = hash("std::vec::Vec") + hash("std::string::String")

The programmer does not need to do anything. It's perfectly see-through.

No collisions between crates:

hash("my_crate::User")  ≠ hash("other_crate::User")

Paths in Rust are always unique, by definition.

When does the hash change? It only changes if how it was computed is changed explicitly. It is not affected by updates to rustc, assuming no changes were made. It is an explicit breaking change — similar to semver.

How it differs from #3435 and #3470

  • #3470 (crABI) resolves issues regarding inter-language memory layout and representation, but does not provide a method of formally defining stable type identifiers.
  • #3435 (export) deals with how to export symbol information dynamically. It requires stable identifiers, but also does not define them.
  • The proposal presented in this document will provide the missing piece, which is an unambiguous, deterministic identifier for all types.

Importance

The use of stable type signatures is critical to having Rust-to-Rust dynamic linkage, as the absence of stable type information means that any dynamic linkage would require unsafe FFI in Rust, and it creates a major source of pain when building plugin systems. Additionally, crABI and export will have no foundation to work from for stable symbol (and possibly ABI) mangling. Therefore, the proposal presented here is the first concrete step toward establishing a stable ABI while not breaking anything in the present.

Outstanding Issues

The exact algorithm for combining generics having more than one type parameter is not yet resolved. Also unresolved is how combine interacts with types that have generics which are defaulted (such as HashMap).

This idea was raised here because I think it needs more technical evolution than I could do on my own. I am looking for input as to whether this direction has been investigated to date and, if so, what the status of that discussion is.

What is the "full path to the type"?

I think this does not address the fact that two types can have identical paths, in case multiple different version of the same crate are present in the dependency graph.

Neither does it address what happens if two builds do not share the Cargo.lock file and end up pulling the same type from two different minor versions of a crate.

5 Likes

This is not true, as a simple example:

[dependencies]
synv1 = { package = "syn", version = "1" }
synv2 = { package = "syn", version = "2" }

(The --crate-name flag passed to rustc is essentially arbitrary, you could pass the same thing for every one of hundreds of dependencies and still get a working build out).

1 Like

What about hygiene? I can use a macro to have two distinct things with the "same" name in the same module.

2 Likes

Today one can change the path of a type (e.g. move something from std to core) in a non-breaking manner. This would make it a breaking change.

Not every crate is a crates.io crate. Some are in alternative registries and some are not in registries. And even crates.io crates get vendored, patched, sourced from a repo, etc.

This probably has the same challenges/limitations around lifetimes as TypeId does. There are also details that would need fleshed out around higher-ranked types and dyn types.

There are a number of potential issues around hashing (collisions, change of algorithm) which can be avoided by just talking about paths directly.

1 Like

Ok, but are crate identifiers + version + path from crate root to type unique?

Rustc does not know about crate versions. The unique identity of a crate (internally called StableCrateId) is crate name + all -Cmetadata arguments + crate_type==bin + rustc version. Cargo hashes a whole bunch of things into -Cmetadata to ensure crates that cargo believes to be unique don't overlap in ABI. This includes not just the crate version, but also all dependencies as well as the origin of the crate (you can have two crates with the same version from different locations. eg one from crates.io and one from a local path). And the rustc version has to be hashed in to ensure you can link two cdylibs or two staticlibs containing the same dependency, yet compiled with different rustc versions together without causing symbol conflicts.

And then the path from crate root to type is not unique within a crate. The DefPath also contains disambiguators for each path component. This is necessary to handle things like anonymous types or

fn foo() {
    {
        struct Bar;
    }
    {
        struct Bar;
    }
}

where you did have my_crate[1234]::foo[0]::Bar[0] and my_crate[1234]::foo[0]::Bar[1] as different types.

3 Likes

I think you misunderstood #3435, what is needs are identifiers based on the "identity" of a type such that if the type changes in some non-ABI compatible way then the identifier will change as well. Your proposal explicitly does not do this.

2 Likes

The complete type path is the same as what Rust uses as a crate name + modules + type name. Examples of this can be seen are std::vec::Vec and serde::de::Deserialize which already exists in Rust today. In order to fix the version issue, the hash of the crate will contain the major version.

Examples: hash("serde@1::de::Deserialize") != hash("serde@2::de::Deserialize") The major version will only be used. Two builds of serde 1.0.100 and 1.0.200 as long as they are the same major version will create the same hash. It makes sense because the types are the same and have always been compatible with each other.

If a crate violates semver (breaks) on minor versions, it is a fault of that crate and not a fault with this identifier scheme.

I agree. I made a mistake in my previous answer regarding how a single unique path cannot be identified with multiple copies of the same crate.

As Daniel Fath stated, the combination of the crate identifier, major version and path containing the cargo root type is different. Therefore, those pieces of information must be included in the hash.

Here is an example. The hash for the type “syn@1::Type” will be different than the hash for the type “syn@2::Type”.

The previous example with synv1 as an alias for syn and synv2 as an alias for syn will therefore have different hashes because the crate name and version will be different between the two crates.

Thank you for correcting my mistake. You have clarified this issue.

I hadn't thought of that before and you are right. Hygienic macros create different objects that have the same path but are stored as different objects in the compiler.

For the time being, I have no definitive answer. In this case, would the existing hygiene information that the compiler tracks internally be enough to determine which is which in the hash? Or would a completely different method be needed?

Several key points were raised in your comments:

  1. Moving Types Across Crates is a Breaking Change. You are correct — If a Type was changed from being in std crate to being in core crate, it would result in a change to the hash. That is a trade-off we must weigh in determining whether or not we are comfortable with the trade-off to get stable identifiers. If you do move Types and change its hash, it would be an "explicit and detectable" breaking change, not a "silent" breaking change;

  2. Non-Crates.io Crates. Good point. The Identifier System would need to accommodate vendored, patched and Registry-Alternative crates. I don’t currently have an answer for this. It is one of the major gaps in the proposal;

  3. Lifetimes, Higher-Ranked Type Constraints, and Dyn Types. Up until now, I wasn't aware of the numerous limitations concerning the TypeId. If TypeId has the same restrictions in this proposal, that is a substantial limitation. Would it be possible for you to supply me with some information so that I can gain an understanding of what allows and/or limits me with respect to these restrictions?

  4. Instead of using hashes, you could use paths directly, which would help avoid collisions and algorithmic issues. But this proposal focuses on using the identifier as part of symbol names for dynamic linking purposes. For instance, the symbol name for the type std::collections::HashMap would become extremely large as a result of the dynamic linking process, whereas a fixed-size hash derived from this path would be much more practical. The purpose of creating a hash is to achieve a stable way to identify type while enabling dynamic linking — therefore, if there are other potential solutions besides hashing, I would appreciate any insight into those options.

Thanks, this is awesome. It's useful and clarifies something for me (I didn't know this before).

My understanding of what you said is that StableCrateId includes the unique id of a crate, which includes its origin, version, and what it depends on. Also, DefPath and its disambiguator(s) are already established mechanisms for establishing unique types in a crate.

Would combining StableCrateId with DefPath make a practical basis for stable type identifiers? The one real question remaining is whether or not this can be made stable between compiler versions, as opposed to them being implementation details of rustc.

Yes, I was incorrect about what I thought #3435 needed. My proposed solution addresses a different problem than what #3435 was originally intended for; specifically, stable type identification of a type across different compiler versions/builds, not detecting an ABI incompatible change. There are two different problems.

Is there any existing work or discussion on stable type identification that is not tied directly to ABI compatibility?

They may have different definitions between minor and patch releases and thus wouldn't be ABI compatible. For example the source may have reordered fields or a field may have been added if it was #[non_exhaustive] or a private field may have changed.

DefPath already contains the StableCrateId. The DefPathHash is the stable identifier used by incr comp. It is still sensitive to non-abi changes though.

Not easily. DefPath components encode a lot of implementation details.

For higher ranked types, you need a way to distinguish between e.g.

fn(&str) // aka for<'a> fn(&'a str)
fn(&'static str)

In addition to fn pointers, dyn types can also be higher-ranked (dyn for<'a> Trait<'a>).

For dyn Trait types, you need to decide what their "path" or other unique identifier is (based on the trait, presumably). Then you need to account for every possible dyn Trait + Send, + Sync, + Send + Sync combination somehow. Another detail is that associated types with Sized bounds may or may not be specified.

TypeId does handle the above issues, but it's limited to types which meet a 'static bound. Perhaps this closed tracking issue is a good place to start learning about that limitation. (TypeId also has hashing concerns, but AFAIU they are solvable via changing implementation details.)


All that being said, I personally suspect "it's now a breaking change to move your struct and trait declarations around" is a blocker for the path based concept.

I understand the DefPath issue. I’ll see if I can look into this further.

Thinking about this, I wonder if the solution might be to separate the type’s identity from its layout. Would it be feasible to use a 128-bit identifier split into two parts?

The first part would be for the stable path (the crate name and the type’s path), which wouldn’t change.

The second part would be like a fingerprint of the layout (only the published structure and the types of the fields).

With 128 bits, we’d have enough space—we’d even have some left over for both things without the risk of collisions. This way, the identifier wouldn’t just tell us what type it is, but it would also let us know if its structure has changed. Do you think this kind of approach will help resolve the sensitivity you mentioned regarding DefPath?

I have reviewed issues #1849 and #129014. It seems to me that there is a clear consensus that the current type_id (based on SipHash-1-3) is not sufficiently collision-resistant, which poses a real security risk (unsoundness).

My proposal to use a 128-bit identifier (or even extend it to 256 bits, as suggested in RalfJung’s thread), divided into a Stable Identity and a Layout Fingerprint, seems to align with the “Possible Solutions” that the compiler team itself is currently discussing. If we implement a scheme based on pointers or a longer hash, we could ultimately have a Typeld that is:

Stable across compiler versions (via the Path identity). Secure (using a cryptographic fingerprint such as BLAKE3). Flexible, allowing structures to be moved between modules without breaking compatibility, provided that the compiler always manages the mapping from the path to the identity.

Given that the team is already considering a meeting to design this feature, don’t you think it’s time to reconsider the 64-bit limit and seek the most robust solution for ABI stability in Rust? I am aware that the ABI is intentionally unstable, but I have decided to ask this question.

It seems to me that the first step would be to convince the relevant teams that TypeId should be intertwined with stable type identifiers in the first place.

Thank you for your perspective.

I believe there are two approaches, and both lead to the same conclusion: crABI needs stable type identifiers, and TypeId needs to increase the size of its hash to 128 bits to be collision-resistant.

These aren’t two separate problems; they both address the same concept: ensuring clear identification across each compilation. Implementing them as two separate features would create two sources that are not truly contradictory regarding the same concept.

The question isn’t whether they should be related, but whether it makes sense for them not to be.