Alternate registry package rebuilds

jonhoo · November 17, 2020, 1:39am

I'm trying to set up an alternate registry that holds private packages. The .crate files for those private packages are built and managed through an external system, which may sometimes re-generate the .crate file even though the crate and its metadata have not changed. But that re-generation also changes the checksum compared to what is in the Cargo.lock, so any subsequent call to cargo will fail with a checksum mismatch.

Unfortunately, as things stand today, it doesn't seem like there's a great way to work around this problem except to blow away the Cargo.lock entirely (which also affects the checksums for crates-io packages), or to parse and patch the Cargo.lock file manually (which is obviously error-prone and a pain).

I see two possible ways to remedy this situation, and am curious to hear arguments for and against each one.

The first option is fairly "coarse", but I suspect it applies to many real-world uses of alternate registries. Since I completely control this alternate registry, I trust its build artifacts, and therefore I should be able to tell cargo not to worry about checksums for this registry only. Something like

[registries.alternate-registry]
# ...
verify-checksums = false

I had hoped to be able to approximate this by writing <none> into the checksum field in the alternate registry index, but that hits the other checks for checksum mismatches in cargo.

The second option is to have a fine-grained mechanism for selectively updating the checksum for specific packages. Something like:

$ cargo update --allow-checksum-mismatch \
   --registry alt-registry \
   -p 'rebuilt:1.0.3' -p 'also-rebuilt:2.2.1'

This would set rebuilt and also-rebuilt to version 1.0.3 and 2.2.1 in the registry respectively, and allow the checksum to change if the version stayed the same. Something more concise would be nice, but at the same time this is something we'd want users to be very aware of using, so maybe making it this verbose is a feature?

matklad · November 17, 2020, 9:58am

I think Cargo in general assumes immutability of the registry.

Could you elaborate why rm Cargo.lock is not a good enough solution for this use case?

kornel · November 17, 2020, 12:03pm

Ideally you should fix the crate-generation system. Can you prevent it from overwriting existing files? In Cargo's model there is never a legitimate need to change a .crate file.

Or if it can't avoid losing old crate files and has to rebuild them, can you force it to release the rebuilt crate as a new version (1.0.0-rebuilt-at-xxx) and yank the old one?

Or maybe you could make the rebuilds reproducible and idempotent by ensuring files in the tarball are sorted and have timestamps set to some fixed date.

Nemo157 · November 17, 2020, 12:17pm

With some bugfixes on the Cargo side build metadata might be usable for this.

For an example I did this in this script and at least for downloading from crates.io, deterministically modifying the package content, rebuilding the package with cargo package, then retaring the archive with sorting and mtime clamping I got deterministic hashes.

bjorn3 · November 17, 2020, 4:07pm

@Nemo157 Is there any reason you manually build the packages instead of using cargo package? Cargo does a bunch of normalizations that you seem to have missed like removing git =.

Nemo157 · November 17, 2020, 4:54pm

I don't build the package manually, I use cargo package to build the package at first, then un-tar and re-tar it to apply sorting and mtime clamping to make the result deterministic. Also I'm starting with a package that was already published through crates.io and then modify it, so it's already had that normalization applied to it.

jonhoo · November 17, 2020, 5:14pm

These crates have dependencies from both crates.io and from the alternate ("mutable") registry. I want to continue to respect the lock file for crates.io, but removing it would ignore the lock file for both registries.

Ah, so, the build process actually generates the alternate registry (it's a local-registry) on each build, which means it does not know about any other versions than the ones it's building. So I can't really yank the old versions. But that may not be necessary, depending on what kind of checking cargo does if it doesn't find the current version in the registry. I could add a suffix like you suggested, though the downside of that is that the Cargo.lock would change with every build. That's not the end of the world, but would be nice to avoid.

Normally that's the approach I'd go for, but I may have oversimplified the problem a bit too much in this instance -- in reality, the build system re-builds on every git commit, and on demand (i.e., when the directory may be dirty), so the source files may in fact also have changed. I know that technically the authors should then also bump the semver patch version, but it seems unrealistic to ask devs to update their patch version on every commit. I could maybe auto-generate the patch number, though that gets into the same concern as above: it requires the lock file to change each time.

Actually, thinking about it now, I really don't know what cargo will do if the version from Cargo.lock simply disappears from the registry (as it will for this alternate registry). It'd be very sad if every generation of the alternate registry had to include every built version ever to function. And would make local development a huge pain, since you'd have to sync your registry frequently.

Nemo157 · November 17, 2020, 5:33pm

This is already going to happen if the checksum changes.

Some very quick testing shows it getting quite confused in some situations . But those are probably bugs that need fixing. Most situations it deals with correctly by just re-resolving a compatible version. If build metadata was working correctly then I think having the build process do something like insert the checksum into the version's build metadata would work fine.

jonhoo · November 17, 2020, 5:45pm

My thinking with verify-checksums = false is that cargo would not even write out the checksum (i.e., it would use <none>) for that registry. In which case the Cargo.lock would remain valid.

Hehe, yeah, I'm not surprised

My thinking with the cargo update -p proposal was to explicitly tell cargo that it should "refresh" a particular version. That might avoid some of the weirder behaviors without relaxing what cargo tolerates too much. But I agree we should probably have it give good errors in these cases regardless.

I'm not sure I follow what you're saying here?

mbrubeck · November 17, 2020, 5:56pm

Semver “build metadata” is an optional string that follows a + sign at the end of the version number. Two versions that are the same except for the build metadata are considered equal for comparison purposes.

Unfortunately, there are bugs in Cargo that show up when using build metadata, for example:

jonhoo · November 17, 2020, 6:06pm

Ah, interesting! That does seem like it could come in very handy here. It does still leave the issue of the Cargo.lock changing, but I suppose I could always reset it after a build to avoid giving the developer who calls build a dirty working directory.

kornel · November 17, 2020, 10:21pm

Registries are not supposed to be used for code during development, but for final deployed snapshots only. For development you should use path dependencies. Your tool, instead of regenerating a registry, could make the required dependencies available in a folder, clone git submodules, etc.

You can use [patch] section to use both registry for releases and paths or git repos for dev.

jonhoo · November 17, 2020, 11:33pm

Yeah, I'm aware of that Unfortunately, that doesn't really work well if you want network-isolated continuous builds. There isn't anywhere central to publish to or pull from -- your internal/private dependencies are always built from source. It's true that I could inject a list of patch directives in each crate to point them at the output directory of each dependency, but that seems like more of a hack than generating a registry on the fly? I'd also still need to generate the registry index (I think), as otherwise cargo would reject, say, a dependency that says foo = { version = "1.0.0", registry = "internal" }.

kornel · November 18, 2020, 11:15am

[patch] isn't per dependency. In fact, it doesn't even work in dependencies. It's per workspace, and you only use it at the top level.

Create a new workspace, add whatever crates and patches to it that you want, and it will build patched crates. That's way easier than building a whole registry and crate files.

If your build system can get a .crate file, it should also be able to unpack it to a folder. If you can inject registry = "…" to the deps, you can just as well inject path = "…".

Internal throw-away registry is an incorrect use of a registry, and it makes everything more complicated for no benefit. You're fighting the core principle of registries trying to make them do exactly what path deps are for.

est31 · November 18, 2020, 12:29pm

I have done something very similar to this in cargo-local-serve where I deduplicated multiple releases of a crate by hashing each file stored inside a .crate file and then storing those files compressed individually in a kvstore.

I have tried making the rebuilds reproducible and as you point out, order and metadata are important, so I preserved the headers. However, the hash that cargo builds is on the entire .tar.gz file (crate files are renamed .tar.gz files) instead of on the .tar file. The tar file is nicely reproducible but deflate depends on the deflate implementation. When I started the project it all used a single zlib implementation, so it was easy for me to rebuild the .crate files. Then, sadly cargo started using the OS provided zlib, which does make the cargo binary a bit smaller, but also causes a lot more zlibs to be used for creation... I've wondered about creating wasm builds containing various zlibs but mostly wanted to wait for the wasm ecosystem to mature.

est31 · November 18, 2020, 12:30pm

Ideally, cargo would hash the tar files instead of the .tar.gz files but this has major backwards compatibility issues... all Cargo.lock files out there would have to change.

rpjohnst · November 18, 2020, 5:13pm

Perhaps this sort of "path registry" hybrid could be supported as a feature- Cargo.lock doesn't include a hash for path dependencies, and registries are a convenient way to package up a bunch of crates.

(If you can manage a workspace and [patch] that does seem like it might be a better option, though.)

jonhoo · November 18, 2020, 5:22pm

I think we're talking past each other. I am not injecting anything into the package that's being built. The developers write out their dependencies, and explicitly mark which ones are internal dependencies by giving registry = "internal" for those dependencies. One day I'm hoping for internal to be an actual, real registry, but for the time being it is generated on-the-fly.

I'm aware of how [patch] works. But, it still requires that there is a registry there. If someone writes in their Cargo.toml:

[dependencies]
foo = { version = "1.0", registry = "internal" }

[patch.internal]
path = "foo/"

Then cargo (correctly) will complain:

error: failed to parse manifest at `/Users/jongje/foobar/Cargo.toml`

Caused by:
  no index found for registry: `internal`

I really don't want to be in the position where developers have to specify path = for internal dependencies, since they have no way of knowing in advance where each the artifacts for internal dependencies are going to end up during a build. And that location will likely depend on context. I could carefully rewrite their Cargo.toml on each build, but that seems like a highly brittle approach.

To me, it seems like registries are exactly the right option here. It's provides a standard mechanism to get access to a collection of crates that you can depend on, without being strongly coupled to how those crates are initially produced. The challenge is that, as you observe, cargo assumes that registries are immutable, which is tricky to square with this particular use-case. If it is simply impossible to use registries for this task, that seems unfortunate, since it feels like it's the right solution.

The build metadata approach seems pretty promising, although it does leave the challenge of the Cargo.lock file which would then change on each build. It also seems like cargo gets pretty confused if the version in the Cargo.lock disappears from the registry (though I'm hoping an explicit cargo update -p might deal with that issue.

ratmice · November 18, 2020, 7:55pm

I am somewhat in a similar boat to Jon, instead of using path = I ended up using foo = {git = "git://internal/foo.git" branch="something"} style dependencies...

The main issue i've run into is having commits to Cargo.toml to change the branch, which get reverted when the branches get merged. That is tedious enough that i've been considering using git-repo to checkout and build the worktree's, with repo linking in a Cargo.toml at top-level containing a workspace.

With git-repo being responsible for doing all the checkouts/worktree building I could then revert back to path style dependencies without having to e.g. rsync over the path dependencies in some ad-hoc way... I would have preferred to have gone with a local registry initially like Jon proposes here, but was warded off of doing so (In some previous thread). I'm sure using git-repo instead will probably come with its own sets of problems. For instance it basically takes crates.io, semver and registries entirely out of the equation, I'm not sure if having a uniform build process is going to be worth the loss of that.

bascule · November 18, 2020, 9:53pm

As someone who's interested in running an alternate package registry, but strongly prefers crates.io's immutable model, I'm seeing a lot of discussion in this thread that confuses me and I don't quite understand the motivations.

I'd be curious if anyone could concisely recap and summarize any or all of the following:

Why are rebuilds occurring? Why are they helpful?
What problem do rebuilds solve that can't be solved in a better way?
Why aren't they bit-for-bit reproducible?

I will say the zlib nondeterminism is quite interesting to me! Compression and security considerations have a rather storied and contentious past. I wouldn't have expected zlib to be canonical if you asked me that up front, but hearing about it causing nondeterminism in practice is something where I don't think I've fully considered its impact on hashed archives before.

Are there other specific problems like that anyone can call out?

Topic		Replies	Views
Discussion: Improved UX for Distributed/De-centralized Development cargo	26	2571	December 22, 2024
Silo effect of alternative registries	28	2741	March 25, 2019
[Pre-RFC] [Cargo] Support vendored/local-directory alternate registries internals	5	1575	October 28, 2019
Perfecting Rust Packaging - The Plan	51	20271	March 25, 2019
[Idea] Cargo Global Binary Cache cargo	32	6555	March 31, 2019

Alternate registry package rebuilds

Related topics