Alternate registry package rebuilds

I haven't bothered to learn the details, but this is something that Debian package maintainers have had to deal with.

What confuses me is that you've mentioned crates from the "registry" can change and even be in a git-dirty state. That sounds like these dependencies are not merely used as some immutable published artifact, but they're being actively edited. That's a use-case for a path dependency or a patch.

The typical workflow is like this. You start with:

[package]
name = "parent"

[dependencies]
foo = { version = "1.0", registry = "internal" }

When you decide to work on both parent and foo projects at the same time:

  1. Clone foo
  2. Add [patch.internal] foo = { path = "./foo" } to parent's Cargo.toml
  3. Edit both projects
  4. When done, do cargo publish in ./foo, and in parent remove [patch]
  5. Update dep version in parent, or if it's compatible, do cargo update -p foo

I also need to work with internal dependencies and builds that run off-line(ish). I solve that by using git submodules and path = "./submodule" deps. This makes recursive checkouts complete and build without further network access. Git gives consistency guarantee for each commit. But I get that git submodules are evil back-stabbing bastards, so it's not a solution for everyone.

2 Likes

Hehe, sorry, I may have tried to over-simplify the problem to the point where important context was lost. So let me try to give a better overview.

I'm working on a build system that allows both external dependencies (e.g., from crates.io) and internal dependencies (which are managed entirely by this build system). External dependencies go to a mirror crates.io. Internal dependencies are not published to or pulled from a registry. Instead, the build system will build internal dependencies before the "current" crate, and then make them available to the current crate build in a manner of my choosing.

The build system is used in two contexts: to build as part of a deployment pipeline, and to build on developers' local machines during regular development. In the latter case, they may or may not be modifying their dependencies. In all cases, the first build always pulls in things from source and builds from there.

Two related challenges arise: how do internal dependencies get exposed to a crate, and how do we handle the fact that there is no central "artifact" repository (and so each build is "from scratch").

For the first, I'd like the experience to be as simple as possible for the developers. They simply specify

[dependency]
foo = { version = "1.0", registry = "internal" }

in their Cargo.toml, and the build system makes sure that those internal dependencies end up in a way where that works. This is super nice for the developer -- there's just that one familiar-looking line. However, behind the scenes, this gets complicated by the second challenge. Since everything is built anew and independently in each build environment, the registry is not immutable. Specifically:

First, the exact same source built on two different machines produces different .crate files. This causes checked-in Cargo.lock files to become a problem, since the checksums won't match. This could maybe be remedied with deterministic tarballing.

Second, since the build system pulls down the latest revision of internal each dependency (i.e., the latest commit), the source files may change without the version number of that internal crate having been changed. Maaybe the build system could automatically replace the patch version for each crate, but that seems pretty hacky. It's also difficult for it to do so in a reliable way -- how would it know what patch version to assign such that it is larger than what's in the Cargo.lock? I suppose it could parse Cargo.lock, but that seems painful.

Third, developers can have workspaces of related crates that are all built and released together. These are not cargo workspaces, but rather collections of closely related packages in the build system. In general, it's likely that developers will modify such related crates in tandem, and it'd be unfortunate if they had to specify a patch for each such related package every time they want to do development. It's also not clear what path they set, since that will depend on their own workspace layout, and won't easily map to what directory layout other developers may have on their setups.

Together, these concerns are what lead me to the initial question of how I can make cargo simply accept a given registry as mutable. It solves all the three issues above while making the developer experience very smooth. Maybe the answer is "you can't" for some much deeper reason in cargo, but it seems like the primary restriction is the checksum check. If that is indeed all that stands in the way, it'd be nice to find a workaround. While it is true that cargo today assumes that registries are immutable, there's a question of "is that a necessary restriction for all registries"? And if so, why?

I'll add that I have little ability to modify how the build system at large works. It's a mature beast with may backwards-compatibility constraints. In particular, supporting a centralized registry for internal dependencies is a no-go, because (for security reasons) the build is entirely network isolated.


So, to the specific questions that came up:

I hope I mostly gave the necessary context to these questions above. Essentially, they occur because there is no actual central repository for internal packages -- each build is from source, and the source may differ without the version number changing.

I hope the above gives some insight into why I don't want to do this with submodules. Basically, developers shouldn't have to declare all their internal dependencies as git submodules, they should be going through the build system. But, by doing so, it's not clear how they would even specify a path for a patch. Whereas using registries makes the experience very nice (but faces issues since the registry isn't a "real" registry).

What about your build system creating a local directory next to the target directory at the top of the repo (or wherever else you want it to be)? Your build tool can scan the Cargo.tomls in the workspace and populate the directory with symlinks to a git checkout that the build tool creates. Overriding with a local thing can happen through your build tool. Unprivileged users of all major OSs can create symlinks nowadays, including Windows 10.

Or alternatively, just go via the lockfile parsing route. There are crates out there that can do that.

2 Likes

The answer is "you can't", because by design registries are supposed to be external, permanent, and immutable. Your thing is local, temporary, and mutable. You're trying to make lock files not lock the version, and not verify integrity of the crates. That goes directly against the goal of lock files.

The registry="internal" is a fake familiarity. It's not really a registry, because you don't have published immutable snapshots of crates. They don't have locking and updating behavior of a registry. You have path dependencies with extra steps.

Your registry-generator would be better off putting dependencies in some local folder. The devs should be using path dependencies. This is the feature designed for your situation. The lock file will accept path deps changing and won't have a checksum for them.

Also if the ultimate source of your deps is coming from git, consider using git deps. They also work natively with Cargo, and can be gracefully locked to a specific commit. If you need to have some local building-serving machinery for them, consider changing your registry to be a local a git server.

1 Like

FWIW, I'm not a big fan of the idea of verify-checksums = false, and would like to offer a counter proposal. In Cargo already there is a precedent for having build = false, and also build = "build.rs", so values already are dynamically typed. Instead of a per-registry setting for verifying checksums, It would seem to me that a more flexible way would be to allow version to be specified as false.

foo = {version = false, registry="internal"}

with some registry flexibility on how the registry responds to false, e.g. crates.io could respond with no version at all, or the latest package version, while Jon's registry could return the latest commit. With the stipulation that if a registry does return a version, it doesn't check the checksum...

Anyhow the feature makes more sense to me at this per-crate level than per-package. This doesn't do anything about tarball reproducibility of should-be identical packages (if that is a problem?)

Yeah, I think that's probably the path I'll end up going with. It feels really unfortunate to have developers specify path dependencies to a "magical" path that the build system will generate for them, but it seems like it is the best way forward without cargo gaining support for mutable registries.


I wonder whether there may be a way to extend cargo so that it has support for "a folder of path deps" (ideally in .cargo/config.toml) so that the dependencies can avoid specifying an actual path and use a name instead. Something like having Cargo.toml say (names should be bikeshedded):

foo = { from_path_dir = "internal" }

and with .cargo/config.toml having:

[paths.internal]
dir = "set/by/the/build/system"

from_path_dir would basically be an instruction to look up the path for the corresponding name in paths, and then treat it as a path = {{path.dir}}/{{crate}}. This exact mechanism might be clunky, but I'm wondering if something that allows aliases for paths might help significantly with the user experience.

3 Likes

It might be possible to add a feature to cargo if the target of a path directive is a workspace, to then take the crate with the specified name. It already works for git repositories, so would only make sense to extend it to workspaces.

2 Likes

That would be neat, though I'd worry about the generality of such a solution since workspaces don't always "stack" nicely. If I generate a workspace to host a collection of path dependencies, but one of those path dependencies itself is a workspace, my guess would be that cargo would be confused?

Just having reread this thread I am confused by the situation. You say that you have a mix of internal and crates-io packages. And for why you can't have an immutable registry for your internal packages

Unfortunately, that doesn't really work well if you want network-isolated continuous builds. There isn't anywhere central to publish to or pull from ...

So how do your builds get access to crates-io packages without network access? Why can't whatever server is trusted to run the mirror of crates-io also run a real internal registry?

So, the mirror of crates-io isn't actually a network mirror, it is a local clone of a git repository that is set up outside the build jail. The downloads of individual .crate files are accessible through a particular read-only asset cache available inside the build jail. The environment is very tightly controlled, so allowing more network access from inside of it is unlikely to happen.

Even if it were possible to have a shared network mirror for internal, it's not clear how that would work in practice. Does every build of every internal package, including dirty builds, go to the mirror? How are version numbers assigned if so (not every commit changes the version number)? It also raises the question of "why do I have to go via the network to pull in a crate that I already have the source for locally"?

What type of local registry are you using? Have you tried using a directory registry? That would let you use an unpacked directory, rather than a .crate file; it also gives you full control over what crate checksum you say the directory has, if any. You should never use that for modified sources, but if you have effectively unmodified crates and the only issue is that you can't reproduce the .crate files, a directory registry would sidestep that.

1 Like

The local registry is currently a local-registry. It could be a directory instead, and I could point it directly at the source directory of each internal crate, but I'd be worried about that causing issues down the line as we'd then not be taking into account things like excludes from Cargo.toml (I think that's done by cargo package, though I could be wrong). Ultimately though, I don't think it solves the issue since the source can change, which is what makes me think that if cargo truly has to require registries to be immutable, going with some kind of "group of path dependencies" solution may be the path forward.

If you want to respect excludes and similar, just unpack the .crate file to make the directory.

Your build process is creating the registry either way; this just removes the checksum mismatch issue you currently have. And this allows you to avoid specifying a precise path to each crate.

Yup, compressing + extracting + using directory would indeed avoid the checksum issue. I guess what I'm asking more fundamentally is whether that is an acceptable way to go about this. It sounds from the other replies as though cargo really doesn't want you to have things in a registry that might change, and that you need to use path for this. My worry is that there are some internal cargo invariant that I might violate by trying to do this through a registry.

My worry is that there are some internal cargo invariant that I might violate by trying to do this through a registry.

Cargo is opinionated. It is intentionally designed to foster a healthy and open ecosystem, at the cost of some use cases. ( According to some of the oldest design documents for Cargo I have access to. ) If you have a lockfile there is nothing anyone else can do to break your build. That implies that registries are immutable. That implies that publishing a new version is a deliberate and intentional act. The ability to maintain a fork of a dependency is given intentional semantic salt, so as to encourage you to get your changes upstreamed. But not to much salt, it wants you to fix problems and not just work around them. Overall it has worked amazingly well at forcing the ecosystem to be basically a good one. ... now wat was my point again. Right.

Some of the principles of Cargo that build a good open community, may not be needed for a closed community. It may be that "semver" and deliberate publishing is not needed when you know all the people that are using your library. It may be that one can have a healthy closed community without the property "no one can break my build nor force me to update" if the one that is forcing the braking change also needs to make a PR to fix my project. So maybe tooling that ignores lockfiles is correct for that community.

There are definitely places where Cargo makes non standard use difficult unintentionally. Nested workspaces do not work, because we have not gotten around to implementing them yet. ( If that is part of what you need we would love to see that fixed. ) Other places are difficult by design, a lockfile is there to stop you from building if things changed without your consent. You are blazing new ground, that will require a lot of back and forth to help us figure out what is intentional and what accidental. This will involve trying things that we think may work, finding bugs ( and having to re-justify why you are doing non-open things ) to then ether fix the bugs or have to try something else. I am sorry, but we don't yet have answers to what works and what does not, just guesses. Projects that came before you ( like https://cargo-raze and buck: rust_library ) have IMO goten things working well enough for there needs and not continued to the hard conversations. Hopefully with your help we will have better answers for the next person to come along.

6 Likes

That all makes a lot of sense! And just to be clear, I think many of the decisions made in cargo are completely reasonable, and exert the "right" kind of impetus on the ecosystem. That's why I'd like to keep things as close to the way things are now, but with an escape hatch of sorts for environments where those same constraints do not quite apply. And as it would be an escape hatch, it's fine for that hatch to also be "hard to open" as it were.

I think what'd be particularly helpful for me is to understand whether or not cargo requires a registry to be immutable. If, for example, I set up a registry that was built locally on the fly during build using the directory registry index type, and then something changes in a crate in that registry, and then I rebuild, will cargo run into problems? And if it does today, should that be considered an incorrect use of registries, or would not handling that case be considered an error that should be fixed? In some sense, I'm trying to ask what the desired outcome is, because if the consensus is "registries should always be immutable, and if you mutate it, that's cargo-level undefined behavior", then I should obviously avoid using registries for this. If, on the other hand, the consensus is "there's no clear way to do it, but if you manage, then go for it" then I'd be more inclined to try and make it work and fix issues that come up (since I'd consider them upstream bugs).

More concretely, if we land on "mutable registries is a bug", then I'd be more inclined to go the "how do we group path dependencies" route, whereas for "mutable registries are acceptable" I'd probably work towards something like the directory solution or some way to skip checksumming for a "trusted"/"managed" registry.

I think the way I'd put it is this: if you're actually trying to change upstream packages, if you want to maintain forks of packages from crates.io, then you shouldn't pretend that those packages are the same as the ones from crates.io. But if you are attempting to use the same packages as upstream crates.io, and you're just working around logistical issues with package/source distribution by using directory registries, I don't think that's necessarily a problem. Just don't use that to fork packages; if you actually want to fork packages, use a path or a patch entry.

1 Like

Ah, none of this is modifying packages from crates.io. This is about internal packages that are not managed through a centralized registry, but are instead pulled in through a custom build system that then wants to make them available to the current cargo crate.

The current main use of Directory Sources is for vendor dependencies. In that use case any git pull can mutate what is in the source. Is some sense, your tool is just automatically vendoring a projects dependencies. Given that, I think it is a Cargo bug if it gives you trouble. But that is my opinion, I am not speaking for the team, and there is no precedent to point to. Welcome to new ground.

5 Likes