Inspired by one aspect of the recent XZ backdoor (namely, that the released tarballs didn't match the repository), what would it take for us to provide (opt-in) enforcement of correspondence between the git repository and the .crate file?
Concretely:
If your crate declares a git repository,
and your publish is based on a committed version, which cargo publish reminds you to do by default unless you bypass its protections,
then we should be able to check that the .crate file corresponds precisely to that commit in that git repository, modulo the include/exclude filters in the manifest.
We should also be able to check that that commit is reachable from the repository.
Ideally, we could verify this before accepting the crate for publication. In addition to serving as a security mechanism, this would also help remind people to push their commits.
Currently cargo-goggles is also attempting to solve this.
Some complications found when trying to use this:
Tag in repo for the release didn't match git commit (usually a honest mistake, but you would want to verify this too).
Non-git repo (yes, someone was using hg instead, over at codeberg).
Commit ID not reachable from any tag or branch (hm..., though given that it was from the author or ring I'm giving it benefit of doubt, bug has been opened on the crate "untrusted")
Cargo.lock not checked in making the packaging non-reproducible
now Im dreaming of a system that automatically checks that the sources there are actually the same as in the crate file, and adds a little warning icon on crates.io if they are not
yeah, that is a long-term goal. I think @Walter Pearce or @Adam Harvey (LawnGnome) have already prototyped something like that for eventual inclusion in crates.io
Verification will provide a helpful prompt for honest actors, but dishonest actors could always change the tag after verification has passed.
Which actually raises a wider question: should crates.io periodically reverify that potentially mutable results such as this remain unchanged, eg after some download threshold has been reached?
An entirely different approach could be to remove ability to upload arbitrary tarballs to crates-io. Instead, users would specify a git* repo URL + tag + commit hash, and have crates-io fetch the repo and build a package from it server-side.
If crates.io keeps the tag name + expected hash, then it's easy to check whether this tag still exists and points to the same commit. The packaging should be deterministic, so that others can reproduce the tarball themselves to verify crates-io.
I'm not sure what to do about crates that have autogenerated files. Maybe there could be a pre-publish script, run as WASM, to build these in a deterministic way.
*) and mercurial and pijul or anything that has a verifiable hash, the point is not to limit it to git, the point is to fetch verifiable data out of band.
All of this does mean that submitting to crates.io would require a publicly accessible git repo. What happens if someone uses a different version control system? What happens if the host goes away?
I understand the desire to do more here, and it might be worth it in the end, but I also want to be honest about what it’s giving up / what additional requirements it’s imposing.
The reason for strongly associating releases with a repo is to match users' expectations/workflows. Typically people collaborate on crates and review source code of the repo. Releases typically are tagged in the repo. But for security it matters what is in the tarball, and that ends up having little visibility, is rarely reviewed, and to many users it's surprising that such discrepancy is even allowed.
It's possible to support popular ones.
Then the problem of users assuming the git repo contains the crate source goes away
Note that crate tarballs are public too, and the repo won't need to contain anything more than the releases.
While git has largely won, not quite everyone uses it (I came across one dependency of mine using Mercurial). Also there are a few new VCSes on the rise (pijul and jujitsu comes to mind).
It seems like a bad idea to entrench a specific VCS to the point where you can't use anything else.
If there are features we can get by depending on git, let's depend on git. If another version control system arises, we can add support for that one. Let's not least-common-denominator ourselves into not shipping something useful for people.
I'm not suggesting that there are enough benefits of crates.io requiring git to be worth the transition cost at this point, but I don't think "other version control systems exist" is a sufficient reason to avoid adding useful features for crates using git.
crates.io should link to the specific commit by hash, and if people want to check out the exact source corresponding to the crate, they can check out that commit by hash.
Worth checking as well, but I think the primary check would be comparing the git commit to the crate file. And if we could enforce it on the crates.io side (with an opt-in from crate maintainers), before accepting the crate, then that would stop happening.
This would be an opt-in by crate maintainers, so the initial version could just support git repositories.
@kornel was suggesting to remove support for non-git repos entirely, which @Vorpal was pushing back against. We can support extra features for when git is used as vcs like having crates.io check that the uploaded crate matches a certain git commit and show the result of the check if a git commit is provided, but IMO we should not deny non-git repos entirely.
The git-vs-hg is a distraction here. The same concept can be implemented for multiple VCSes. Which ones exactly deserve to be implemented can be bikeshedded later in RFCs and crates-io PRs.
Every VCS would need to be hard coded into both crates.io (for verification) and cargo (for getting the repo and commit hash). I think it is highly unlikely that both will accept PR's to support every VCS that anyone might want to use (git, mercurial, jujutsu, pijul, fossil, cvs, subversion, plastic scm, bazaar, ...[1]) As such I don't think this should be mandatory. That way we can implement support for just a couple popular VCSes without causing problems for anyone who uses a less popular VCS.
Focusing on just git vs hg is the actual distraction here. As @bjorn3 pointed out, there is a ton of them. And even in the unlikely event that every one that anyone still use is implemented, what about the next up and coming VCS, written in Rust, and they decided it is time to self host?
It is not up to Cargo and crates.io to enforce the git monopoly, nor the git/hg-duopoly, etc. It is bad enough that the authentication system is based on github currently (with no plans for a backup or how to migrate should there be an urgent need).
That said, I believe that Cargo should have this as the default, with an --allow-unknown-vcs flag. And that usage of such flag should show up on crates.io indicating that "hey, there is something going on that you might need to take a look at". This would also take the pressure off from having to implement support for a ton of different VCSes.
The "repository" field should be mandatory going forward after some point as well, since all VCSes I know of have a concept of a URL or similar for accessing the repo (I have a vague memory that it might not have been an URL back with CVS, but I cannot for the life of me remember the details of how to check out code from CVS, I have only done it a few times). Also CVS is an actual real distraction, because no on should still be using it.
As mentioned in this post in the original thread mentioned at the start here, there may be legitimate reasons for diffs between the .crate and VCS checkout. This includes:
Git's export-ignore attribute that leaves files out of its archives
Git's export-subst that does placeholder expansion when inserting content into the archive (note that things like short hashes can change over time as the default short hash length grows as the repository does or collisions appear)
I believe cargo itself can ignore files either by configuration or behavior (Cargo.lock for libraries?)
VCS filters like git-lfs or git-annex being present and active for a given checkout (I only install LFS hooks per-repo as needed…I hate being surprised with large file downloads on clone)
Submodules probably throw wrenches around here because why not
I can think of ways this can DoS crates.io (hi there, here's Chromium's repo…).
There's also an interesting subcase for workspaces. Verification of matching the VCS source tree is "simple" if the package root is the VCS root, but it's a bit more complicated if the package root is a nonroot path in VCS. It's likely mostly straightforward by just looking up the package root in the VCS workspace, but it's still extra complexity to handle.
Namely, I think handling workspace subpackages would easily fall out of handling include/exclude and allowing the .crate package to subset the VCS tree, although parent-traversing relative path fixup is always interesting to compensate for.
The somewhat unfortunate bit is that in order to be of any actual use for audits, the verification needs to be done by the registry (or the audit tooling), not the publisher, as a malicious publish can always just sign the wrong package as being legitimate. The "don't publish dirty" warning is about all that can be done on that side … although adding a "is pushed to remote" check to the existing "is local state dirty" check for supported VCS wouldn't be too impactful an ask, imo, so long as it remains a "whoops did you actually mean that" hint rather than enforced.
There's also the new feature to specify dependencies in the workspace Cargo.toml and inherit in the individual packages. File-based subsetting would leave workspace members mentioned in a file but non-present in the .crate. Would cargo accept such a situation?
Packaged crates already have both an adjusted Cargo.toml which bakes in any fixups for dependency inheritance, remapped filepaths, and path dependencies, and Cargo.toml.orig which contains the original pre-fixup manifest.
An alternate solution is to acknowledge the current reality that for security it matters what is in the tarball, so give that more visibility and workflows for reviewing, such as diff.rs.