Verifying that .crate files match the git repository

This is a continuation of the idea discussed at Making crates.io verify code against repository? , which is currently a closed thread.

Inspired by one aspect of the recent XZ backdoor (namely, that the released tarballs didn't match the repository), what would it take for us to provide (opt-in) enforcement of correspondence between the git repository and the .crate file?

Concretely:

  • If your crate declares a git repository,
  • and your publish is based on a committed version, which cargo publish reminds you to do by default unless you bypass its protections,
  • then we should be able to check that the .crate file corresponds precisely to that commit in that git repository, modulo the include/exclude filters in the manifest.
  • We should also be able to check that that commit is reachable from the repository.
  • Ideally, we could verify this before accepting the crate for publication. In addition to serving as a security mechanism, this would also help remind people to push their commits. :slight_smile:
22 Likes

Currently cargo-goggles is also attempting to solve this.

Some complications found when trying to use this:

  • Tag in repo for the release didn't match git commit (usually a honest mistake, but you would want to verify this too).
  • Non-git repo (yes, someone was using hg instead, over at codeberg).
  • Commit ID not reachable from any tag or branch (hm..., though given that it was from the author or ring I'm giving it benefit of doubt, bug has been opened on the crate "untrusted")
  • Cargo.lock not checked in making the packaging non-reproducible
  • Many other varied and interesting failure modes
  • Apparently proc-macros can end up with interesting problems (circular deps in diesel if you consider dev dependencies, so they need to use allow-dirty). See PSA: Check if your Cargo crates are clean and tagged - #3 by weiznich - help - The Rust Programming Language Forum

Not all of these will be relevant to a pre-publish check though.

3 Likes

This was discussed some in https://rust-lang.zulipchat.com/#narrow/stream/318791-t-crates-io/topic/Link.20to.20where.20the.20sources.20of.20a.20crate.20are.20located.20in.20git/near/429799845

including

now Im dreaming of a system that automatically checks that the sources there are actually the same as in the crate file, and adds a little warning icon on crates.io if they are not :slight_smile:

yeah, that is a long-term goal. I think @Walter Pearce or @Adam Harvey (LawnGnome) have already prototyped something like that for eventual inclusion in crates.io

Related is diff.rs and my proposal to add diffing against git](Support diffing against the original repo · Issue #18 · xfbs/diff.rs · GitHub). Would love to see diff.rs integrated into crates.io.

3 Likes

Verification will provide a helpful prompt for honest actors, but dishonest actors could always change the tag after verification has passed.

Which actually raises a wider question: should crates.io periodically reverify that potentially mutable results such as this remain unchanged, eg after some download threshold has been reached?

5 Likes

An entirely different approach could be to remove ability to upload arbitrary tarballs to crates-io. Instead, users would specify a git* repo URL + tag + commit hash, and have crates-io fetch the repo and build a package from it server-side.

If crates.io keeps the tag name + expected hash, then it's easy to check whether this tag still exists and points to the same commit. The packaging should be deterministic, so that others can reproduce the tarball themselves to verify crates-io.

I'm not sure what to do about crates that have autogenerated files. Maybe there could be a pre-publish script, run as WASM, to build these in a deterministic way.

*) and mercurial and pijul or anything that has a verifiable hash, the point is not to limit it to git, the point is to fetch verifiable data out of band.

6 Likes

All of this does mean that submitting to crates.io would require a publicly accessible git repo. What happens if someone uses a different version control system? What happens if the host goes away?

I understand the desire to do more here, and it might be worth it in the end, but I also want to be honest about what it’s giving up / what additional requirements it’s imposing.

9 Likes

The reason for strongly associating releases with a repo is to match users' expectations/workflows. Typically people collaborate on crates and review source code of the repo. Releases typically are tagged in the repo. But for security it matters what is in the tarball, and that ends up having little visibility, is rarely reviewed, and to many users it's surprising that such discrepancy is even allowed.

It's possible to support popular ones.

Then the problem of users assuming the git repo contains the crate source goes away :wink:

Note that crate tarballs are public too, and the repo won't need to contain anything more than the releases.

2 Likes

While git has largely won, not quite everyone uses it (I came across one dependency of mine using Mercurial). Also there are a few new VCSes on the rise (pijul and jujitsu comes to mind).

It seems like a bad idea to entrench a specific VCS to the point where you can't use anything else.

3 Likes

If there are features we can get by depending on git, let's depend on git. If another version control system arises, we can add support for that one. Let's not least-common-denominator ourselves into not shipping something useful for people.

I'm not suggesting that there are enough benefits of crates.io requiring git to be worth the transition cost at this point, but I don't think "other version control systems exist" is a sufficient reason to avoid adding useful features for crates using git.

3 Likes

crates.io should link to the specific commit by hash, and if people want to check out the exact source corresponding to the crate, they can check out that commit by hash.

2 Likes

Worth checking as well, but I think the primary check would be comparing the git commit to the crate file. And if we could enforce it on the crates.io side (with an opt-in from crate maintainers), before accepting the crate, then that would stop happening.

This would be an opt-in by crate maintainers, so the initial version could just support git repositories.

This could also be checked automatically.

1 Like

@kornel was suggesting to remove support for non-git repos entirely, which @Vorpal was pushing back against. We can support extra features for when git is used as vcs like having crates.io check that the uploaded crate matches a certain git commit and show the result of the check if a git commit is provided, but IMO we should not deny non-git repos entirely.

6 Likes

The git-vs-hg is a distraction here. The same concept can be implemented for multiple VCSes. Which ones exactly deserve to be implemented can be bikeshedded later in RFCs and crates-io PRs.

2 Likes

Every VCS would need to be hard coded into both crates.io (for verification) and cargo (for getting the repo and commit hash). I think it is highly unlikely that both will accept PR's to support every VCS that anyone might want to use (git, mercurial, jujutsu, pijul, fossil, cvs, subversion, plastic scm, bazaar, ...[1]) As such I don't think this should be mandatory. That way we can implement support for just a couple popular VCSes without causing problems for anyone who uses a less popular VCS.


  1. List of version-control software - Wikipedia has a whole lot more VCSes ↩︎

10 Likes

Focusing on just git vs hg is the actual distraction here. As @bjorn3 pointed out, there is a ton of them. And even in the unlikely event that every one that anyone still use is implemented, what about the next up and coming VCS, written in Rust, and they decided it is time to self host?

It is not up to Cargo and crates.io to enforce the git monopoly, nor the git/hg-duopoly, etc. It is bad enough that the authentication system is based on github currently (with no plans for a backup or how to migrate should there be an urgent need).

That said, I believe that Cargo should have this as the default, with an --allow-unknown-vcs flag. And that usage of such flag should show up on crates.io indicating that "hey, there is something going on that you might need to take a look at". This would also take the pressure off from having to implement support for a ton of different VCSes.

The "repository" field should be mandatory going forward after some point as well, since all VCSes I know of have a concept of a URL or similar for accessing the repo (I have a vague memory that it might not have been an URL back with CVS, but I cannot for the life of me remember the details of how to check out code from CVS, I have only done it a few times). Also CVS is an actual real distraction, because no on should still be using it.

5 Likes

As mentioned in this post in the original thread mentioned at the start here, there may be legitimate reasons for diffs between the .crate and VCS checkout. This includes:

  • Git's export-ignore attribute that leaves files out of its archives
  • Git's export-subst that does placeholder expansion when inserting content into the archive (note that things like short hashes can change over time as the default short hash length grows as the repository does or collisions appear)
  • I believe cargo itself can ignore files either by configuration or behavior (Cargo.lock for libraries?)
  • VCS filters like git-lfs or git-annex being present and active for a given checkout (I only install LFS hooks per-repo as needed…I hate being surprised with large file downloads on clone)
  • Submodules probably throw wrenches around here because why not

I can think of ways this can DoS crates.io (hi there, here's Chromium's repo…).

6 Likes

There's also an interesting subcase for workspaces. Verification of matching the VCS source tree is "simple" if the package root is the VCS root, but it's a bit more complicated if the package root is a nonroot path in VCS. It's likely mostly straightforward by just looking up the package root in the VCS workspace, but it's still extra complexity to handle.

Namely, I think handling workspace subpackages would easily fall out of handling include/exclude and allowing the .crate package to subset the VCS tree, although parent-traversing relative path fixup is always interesting to compensate for.

The somewhat unfortunate bit is that in order to be of any actual use for audits, the verification needs to be done by the registry (or the audit tooling), not the publisher, as a malicious publish can always just sign the wrong package as being legitimate. The "don't publish dirty" warning is about all that can be done on that side … although adding a "is pushed to remote" check to the existing "is local state dirty" check for supported VCS wouldn't be too impactful an ask, imo, so long as it remains a "whoops did you actually mean that" hint rather than enforced.

4 Likes

There's also the new feature to specify dependencies in the workspace Cargo.toml and inherit in the individual packages. File-based subsetting would leave workspace members mentioned in a file but non-present in the .crate. Would cargo accept such a situation?

Packaged crates already have both an adjusted Cargo.toml which bakes in any fixups for dependency inheritance, remapped filepaths, and path dependencies, and Cargo.toml.orig which contains the original pre-fixup manifest.

1 Like

An alternate solution is to acknowledge the current reality that for security it matters what is in the tarball, so give that more visibility and workflows for reviewing, such as diff.rs.

3 Likes