Verifying that .crate files match the git repository

Common problems from my scanning of crates so far:

  • 27000+ crates lack repository property.

  • Second most common problem is lack of sha1 in the .crate file. Include cargo_vcs_info.json even for dirty working directories · Issue #13695 · rust-lang/cargo · GitHub

  • Forks and crates that have moved to another repo tend to leave outdated original repo URL in Cargo.toml.

  • Unfortunately it happens that sha1 of the published package is not in the repo (perhaps published from a release branch that didn't get pushed, or got amended/rebased).

  • Getting the right commit from the git tags doesn't always work. The tags may be off by a commit or two. Monorepos make finding the right tag harder, because sometimes tags are only for a primary crate, not helper crates. Some use custom tag naming schemes, and especially prerelease and + suffixes are hard to match.

  • When there's no sha1 and no tags, it's very tricky to find the right commit, because there are many commits with the same version in Cargo.toml, and it's unclear whether first, last or some in between is the published one. I've added fuzzy matching looking for a most similar commit, but it still fails often.

  • Cargo.lock is most often missing in the repo (fortunately it's not that dangerous).

  • README and LICENSE can be missing or moved due to readme = "../../README" pattern.

  • git submodules, symlinks, and workspaces are annoying. Lots of edge cases (there is a crate that has a git symlink pointing to a submodule, which has a .git link-file in the tarball)

  • Cargo does quite a lot of fixups of Cargo.toml, and they've evolved over time, so sometimes it's hard to verify whether the trimmed-down Cargo.toml is an accurate transformation of the original.

  • There's a bunch of crates with buggy tarballs that have duplicate files, or only work on case-insensitive file systems (e.g. have cargo.lock).

  • git performs automagic CRLF normalization, which shows up in crates.

  • ring has a good reason to include precompiled/pregenerated files, but they're generated by a Perl script.

  • grafbase is the Cargo.lock file size champion: 125KB!

  • libgit2 seems to segfault (null ptr deref) on redirects with long URLs

13 Likes

I can understand how most of those could arise from plain user error, but...

This is bizarre, how do we think this is happening? The fact that crates.io uses tarballs is an implementation detail not exposed to users. I could maybe see this happening if we had a historic cargo bug that didn't deduplicate the include field of the manifest. Or is it possible that some developers are using tools other than cargo?

I'm assuming Cargo packages duplicate files on case-insensitive file systems · Issue #13722 · rust-lang/cargo · GitHub was opened from some of these duplicates

Ah okay, so just duplicates from the perspective of case insensitive systems. I was imagining tarballs somehow ending up with two different files both named Cargo.toml.

There are some tarballs from 2016 that have literally the same filename twice (I assume it was an old implementation in cargo before it started renaming old one to Cargo.toml.orig). That's nothing unusual in tar. It will happily append the same file name again if you tell it to, since it's a "tape archive", and it doesn't have a central index.

3 Likes

A similar idea which wouldn't require onerous breaking changes to crates.io would be an additive change to support something like PyPI Trusted Publishers or RubyGems Trusted Publishing.

That's cool, and it'd be nice if crates.io supported it, but it doesn't verify that the published crate contains code from the repository. It verifies that a workflow belongs to a certain github user, but that user can still upload whatever modified code they want.

The OIDC token created by Github contains the exact commit from which the workflow ran as well as which commit the workflow run targets. This allows manually verifying the workflow file to check that it either runs cargo publish without making any changes or if changes are necessary that these changes are harmless. https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect#understanding-the-oidc-token has a list of all fields stored in the token.

2 Likes