Verifying that .crate files match the git repository

Common problems from my scanning of crates so far:

  • 27000+ crates lack repository property.

  • Second most common problem is lack of sha1 in the .crate file. Include cargo_vcs_info.json even for dirty working directories · Issue #13695 · rust-lang/cargo · GitHub

  • Forks and crates that have moved to another repo tend to leave outdated original repo URL in Cargo.toml.

  • Unfortunately it happens that sha1 of the published package is not in the repo (perhaps published from a release branch that didn't get pushed, or got amended/rebased).

  • Getting the right commit from the git tags doesn't always work. The tags may be off by a commit or two. Monorepos make finding the right tag harder, because sometimes tags are only for a primary crate, not helper crates. Some use custom tag naming schemes, and especially prerelease and + suffixes are hard to match.

  • When there's no sha1 and no tags, it's very tricky to find the right commit, because there are many commits with the same version in Cargo.toml, and it's unclear whether first, last or some in between is the published one. I've added fuzzy matching looking for a most similar commit, but it still fails often.

  • Cargo.lock is most often missing in the repo (fortunately it's not that dangerous).

  • README and LICENSE can be missing or moved due to readme = "../../README" pattern.

  • git submodules, symlinks, and workspaces are annoying. Lots of edge cases (there is a crate that has a git symlink pointing to a submodule, which has a .git link-file in the tarball)

  • Cargo does quite a lot of fixups of Cargo.toml, and they've evolved over time, so sometimes it's hard to verify whether the trimmed-down Cargo.toml is an accurate transformation of the original.

  • There's a bunch of crates with buggy tarballs that have duplicate files, or only work on case-insensitive file systems (e.g. have cargo.lock).

  • git performs automagic CRLF normalization, which shows up in crates.

  • ring has a good reason to include precompiled/pregenerated files, but they're generated by a Perl script.

  • grafbase is the Cargo.lock file size champion: 125KB!

  • libgit2 seems to segfault (null ptr deref) on redirects with long URLs

13 Likes

I can understand how most of those could arise from plain user error, but...

This is bizarre, how do we think this is happening? The fact that crates.io uses tarballs is an implementation detail not exposed to users. I could maybe see this happening if we had a historic cargo bug that didn't deduplicate the include field of the manifest. Or is it possible that some developers are using tools other than cargo?

I'm assuming Cargo packages duplicate files on case-insensitive file systems · Issue #13722 · rust-lang/cargo · GitHub was opened from some of these duplicates

Ah okay, so just duplicates from the perspective of case insensitive systems. I was imagining tarballs somehow ending up with two different files both named Cargo.toml.

There are some tarballs from 2016 that have literally the same filename twice (I assume it was an old implementation in cargo before it started renaming old one to Cargo.toml.orig). That's nothing unusual in tar. It will happily append the same file name again if you tell it to, since it's a "tape archive", and it doesn't have a central index.

3 Likes

A similar idea which wouldn't require onerous breaking changes to crates.io would be an additive change to support something like PyPI Trusted Publishers or RubyGems Trusted Publishing.

That's cool, and it'd be nice if crates.io supported it, but it doesn't verify that the published crate contains code from the repository. It verifies that a workflow belongs to a certain github user, but that user can still upload whatever modified code they want.

The OIDC token created by Github contains the exact commit from which the workflow ran as well as which commit the workflow run targets. This allows manually verifying the workflow file to check that it either runs cargo publish without making any changes or if changes are necessary that these changes are harmless. https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect#understanding-the-oidc-token has a list of all fields stored in the token.

2 Likes

@kornel I have just got your new warning about this in my maintainer dashboard on lib.rs and I had indeed forgotten to push the last version when I published it.

Thanks !

1 Like

I just got lib.rs dashboard notifications for some crates I published (dxr, dxr_server, dxr_client). This is the dashboard message for dxr v0.6.3:

Published crate doesn't match its repository

Verified 51 out of 52 files (includes 1 Cargo-generated).

    Crate tarball has been published from a commit that is not in the main branch (searched 11 commits deep).
    Manifest properties don't match: package.build: crates.io=<disabled>, orig=<unset>.
        If you have a monorepo, you can use git tags in format cratename/v0.6.3

Looked for the crate in dxr/. Fetched https://github.com/ironthree/dxr.git tagged 0.6.3 (7903356245d58492f02cf927cdf9569554f591b6).

Checked on 2024-07-13

This check is experimental.

These are somewhat contradictory to me?

The first part is somewhat correct. I did indeed not publish from the main branch, because that already contains changes staged for 0.7.0, and instead created a separate branch for 0.6.x. So the commit is indeed not on the "main" branch, but on the "v0.6.x" branch.

But according to the second part, it looks like the check actually found tag 0.6.3, and did look in the correct workspace subdirectory, but didn't consider the 0.6.3 tag when comparing against version 0.6.3 published on crates.io? I'm slightly confused.

I'll try to make this clearer.

Crate tarball has been published from a commit that is not in the main branch

This is only a warning, it's not treated as fatal in this implementation. However, I'm tracking this and pointing this out, because theoretically it could be used to "cheat" the check by having an innocent main branch, and publishing nefarious code from some obscure branch that nobody is looking at.

The error was in mismatch between Cargo.toml data in the tarball vs repository. It could be a false positive — it's really difficult to compare these, because Cargo Workspaces + implicit properties + editions are tricky to parse, and Cargo rewrites the Cargo.toml before publishing, so its fields change values and change meanings.

Perhaps it is better worded as "reachable from the main branch"? Perhaps it should be "default branch" (given the prevalence of master yet)?

I can understand where you are coming from (in particular, I called out a similar problem to the crates.io team for their check) but this seems like it'd cause too much noise for people who publish from release branches and inadvertently be sending the message they are doing it wrong. At minimum, a check should ensure the commit is reachable from a branch (as I called out in that link). I suspect its easier to have a nefarious tag than a nefarious branch.

1 Like