Silo effect of alternative registries

In discussions about cratesio features/policies one of proposed workarounds is to make an alternative registry instead, which could have the desired features.

I’m considering doing that, but the details of how it would work turn out to be pretty messy. Alternative registries are fine for adding a few new private packages on top of cratesio, but a cratesio replacement is very problematic.

Silo of crate deduplication

The biggest obstacle I see is that the way cargo deduplicates dependencies. Each crate from each registry is considered to be a completely different separate crate. Technically, it makes a lot of sense, but IMHO it means that mirrors of cratesio are infeasible.

If I simply mirror cratesio crates, then they won’t pull dependencies from my registry, causing every top-level crate to be a dupe. If I rewrite them to pull dependencies from my registry, it gets even worse.

A user using my mirrored crates with deps from my registry can’t use any other registry or git deps which depend on cratesio crates, even indirectly. As soon as someone installs a crate that “contaminates” the build with deps from cratesio, it will pull a highly problematic parallel universe of non-deduplicated dependencies with it. That’s not just wasteful, but for crates with the link attribute (rayon-core, sys crates) or shared types/traits (futures, serde) it breaks compilation.

I’m afraid it’s too much to ask from users to switch to an alternative registry as the only registry, and have their builds bloat or break if they accidentally violate that exclusivity.

This is what federation is meant to solve.

1 Like

I’m afraid you’ve misunderstood where the problem is. In a sense, it’s a problem caused by federation decentralization (even in the minimal capacity currently supported by cargo).

It’s the difficulty of merging crates from multiple registries into a single, well-deduplicated dependency tree. For technical reasons not related to, and not fixable by federation, it has to support universally shared (serde) and globally unique (rayon-core) crates, even in cases where there could be more than one candidate for being that crate.

I suppose the solution for this could be to change alternative registries to work a bit more like the [patch] feature of cargo, which instead of adding crates to the pool of available crates, replaces them with alternative versions.

"Deletion" of spam/useless/broken crates is useful of course, but it could be solved without the alternative registry cargo feature, merely by making an alternative front-end that doesn't show them. That's what I'm trying to do with crates.rs.

Apart from that, even an identical read-only view of cratesio runs into the deduplication problem. If you install serde from good-cratesio, your Serialize trait will be considered incompatible with Serialize trait used by crates from other registries or git.

I guess that for such mirror you could make "shell" crates that don't contain any code, but only do pub export actual-cratesio-crate::*. This works around breakage dupes/conflicts, but it the hack is visible for the user. And it's limited to read-only copies, so you can't have crates expropriation policy, and still depends on cratesio, so it's not even a proper mirror.

Federation involves caching. So no, it’s not /caused/ by federation but rather /solved/ by federation. Cargo doesn’t support federation because each instance is completely independent and provides no deduplication and no caching.

Federation makes it so you only talk to one server - your instance - and that server deduplicates across other instances. So transitive dependencies (dependencies of dependencies) would be properly resolved to be from crates.io, and still fetched from your registry. If you wanted to add overrides, you hmm… I’m not sure if my federation model allows the registry to add dependency overrides (newer versions of crates under the same namespace+name)…

But in any case, cargo doesn’t support federation.

1 Like

You use a SNAPSHOT or classifier to the dependency to denote that it's your own version. At least that's what Maven offers and if I've been following you correctly that's the model you're after (and I think is a good idea).

yes, but it has signing, so the cached crates would be signed by the originating registry, and that signature would be checked locally.

which means registry-powered SNAPSHOT overrides wouldn’t quite work…

Yes. That's what I meant by a curated view. I'd love to see two or three or ten sites like crates.rs, with different viewpoints on what a high-value crate is. For that purpose — humans finding good crates — I don't see a need for an alternative registry. (Again, this is coming from someone who isn't much concerned about name squatting or crate expropriation.)

I’m strongly in favour of upgrading crates.io to allow reviews and ratings of crates. Independent reviews presents feedback to crate owners and instruct new users on the use cases of a crate. A 1 to 5 star rating, possibly with a breakdown of several key areas like issue resolution by crate owners and security/bugginess score. Having several criteria to filter a crate on should help greatly in crate discovery. It would be useful to have usage trackers where the extent and frequency of crate usage in projects is retrieved from users.

The benefit is that the ‘curation’ occurs as a crowdsourced event, and poor crates can be revoked more easily because they are not used.

2 Likes

Reviews are a front-end feature, so they don’t affect alternative registries (another website can add reviews without needing to control how dependency resolution works, and doesn’t even have to be a registry).

The alternative registries are brought up, because the crates-io team doesn’t have capacity to build and maintain new features or enforce antisquatting policies, so I imagine building a reviews system and fighting comment spam and voting brigades is going to be dismissed as a job for another site too.

This thread is about a fundamental distributed-systems problem, network consistency, one which federated systems do not need to solve reliably (since they can just ignore unresponsive instances), but a dependency resolution system absolutely does (compilation results should not depend on opaque state of remote caches).

With all due respect, repeatedly pushing for support of the latest hotness (in this case, federation) as some kind of panacea is unhelpful. I understand you're excited about federation, but it is ultimately the wrong shape of distributed gadget.

5 Likes

I’m not excited about federation.

I’m unhappy with the current system, and I’ve seen federation actually work in quite similar situations.

I’ve seen mastodon instances rise and fall. mastodon handles it quite well, and the remnants of some of those instances still linger around in some other instances. while it’s a feature of the software rather than the protocol, it seems to be a desirable feature in any federated dependency resolution system.

in any case, I’m not pushing for it because I’m “excited about [it]”, I’m pushing for it because it works, and it solves many of the problems you’re seeing with (non-federated) alternative registries, like (de)duplication issues, having to push to separate registries separately, and so on.

That is anecdata. I counter that I've seen federated systems have problems, like fragmentation. I don't think this is a meaningful line of discussion.

What feature? Mastodon does not solve topological sort problems; it aggregates the output of multiple remote producers. You do not want to live in a world where different producers might disagree on a universal fact (i.e., the contents of a crate) which would make your solution to this graph problem depend on who you talk to at a fixed time, since crate upgrades are no longer atomic. If you add malicious agents and a sprinkle of social engineering, you've got yourself a proper mess where you can't trust the registries, and now everyone with any competent security team is running their own air-gapped registries.

1 Like

Fragmentation is a feature. And it’s, IMHO, something we need.

I suspect a lot of us feel the exact opposite (i.e., that we need to actively avoid and reduce ecosystem fragmentation), so could you try to articulate why you think this is important? It’s not just because you wanted to “fork Rust”, right?

4 Likes

Please stop. You’re derailing the thread. In this thread I’m trying to discuss actual issue with a real implementation of Rust tool.

3 Likes

I disagree — the point is not about having an alternative repository with an identical set of crates, but an alternative repository which can make its own decisions about which crates to allow, who owns which crate, and maybe even publish their own patches.

Say we had an alternative repository, greatcrates.net, which our hypothetical user wants to use. Say we have some dependencies:

  • bloom, only published by greatcrates.net, depending on petal 1.1.5
  • flower, published by crates.io and greatcrates.net, depending on petal 1.1
  • petal, published on both but with version 1.1.5 only available on greatcrates.net

Now, bloom only has a single source and though flower has two, both have an identical version, so a checksum is enough to deduplicate. But petal has two different versions available, so what does Cargo do? Including both is redundant and may cause issues if the lib has internal state. Using the older version from crates.io is apparently incompatible. Using the newer version from greatcrates.net may be fine, but since flower depended on the original publication on crates.io we can't know this.

It is questionable whether simply enabling an extra repository should automatically pull in newer versions of packages available from that source — especially since a user might enable the repository with the intention to use only a single package. So foo = "0.2" should not simply mean look for foo, version 0.2, in all repositories.

One possibility might be that dependencies are always namespaced by the repository, but that crates may have a provides name with alternative names for the same lib. Going back to our petal example, greatcrates.net/petal could have provides = ["petal"] (assuming no namespace is required for crates.io), which tells Cargo that using greatcrates.net/petal is a drop-in replacement for crates.io's petal, thus safely de-duplicating dependencies.

(Note that as well as allowing each repo to have its own namespace, repos could also have caches, republishing crates directly from other repos. Ultimately, it must be up to the user which repos to trust.)

2 Likes

That sounds like a badly designed federation…

I think that alternative registries should be able to mirror dependencies from other registries (with a potential renaming) and published crate must depend only on crates from its own registry (which could be mirrored from other one). The mirrored crate will have at least the following fields: name, name and registry from which its mirrored, name and registry of the source. The latter two usually will be the same, but it could be useful to allow chains of mirrors. As a consequence non-mirrored crate re-publishing between registries should be heavily discouraged.

In the case of a private company registry they will manually mirror crates from crates.io (and maybe other registries) after source review. One could imagine that some companies will create private registries with payed access and reviewed crates, maybe they will even provide some liability guarantees. Some registries will mirror updates automatically based on webhooks or periodic checks. Of course it could work in other way as well, crates.io can approve some registries as a source of crates to mirror. To make this approval automatic or manual is up to discussion. Personally I think that ideally such source registries must use crates.io sub-domains.

Lets say it will be rand.crates.io, now you publish hc128 to it, which will be automatically mirrored to rand_hc128 on crates.io. So if some crate has in the dependency tree hc128 from rand.crates.io and rand_hc128 from crates.io, cargo will be able to solve version constraints and select a single crate version by using the fact that rand_hc128 is a mirror of rand.crates.io’s hc128. The main restriction will be that mirror versions should be the same as for source. It’s possible to remove this restriction as well (by making registry to return list of hashes which fit a given constraint), but I think it’s not worth the additional complexity.

So if some crate has foo twice in its dependency tree, one foo = "0.2" from crates.io and foo = "^0.2.1 from altregistry.org which mirrors foo from crates.io, but on crates.io the latest version is 0.2.5, but on altregistry.org it’s 0.2.3, then cargo will request versions which fit the relevant constraints from both registries (for crates.io it will be 0.2.0 - 0.2.5 and 0.2.1 - 0.2.3 for altregistry.org) and will select 0.2.3 as the latest one. But if there is a dependency on foo = "^0.2.4", then cargo will use two foo versions 0.2.3 and 0.2.5. Not ideal, but I think it’s a reasonable behavior here.

There is also a question from which registry we should download crates (note that we already know source name, source registry, version and hash of the wanted crates). The most logical option is to use source, but a private company would like to use crates from its own registry. I think the latter case is better solved with appropriate Cargo.toml options (e.g. blacklisting all registries except company owned, or giving it the highest priority).

1 Like

This is a good goal. I'm just worried about a world where surprises can arise due to bad actors and malicious registries, which is something all decentralized systems are vulnerable to. Without consistency, you're going to need to make choices about defaults... and odds are that will something officially sanctioned, like crates.io.

Right, this sounds like a good world to live in, because we no longer have confusing consistency problems. But, I certainly don't want to live in a world where the dependency resolution problem might have different solutions depending on which order you look at registries in.

It's very important to have non-crates.io registries, if, for example, you want total audit-able control of your code supply chain (which is an extremely reasonable thing to want if you handle sensitive data or don't want to entrust your SLA on a third party).