I'm wondering whether it would be possible to make crates.io verify that published code matches what is in a public for repository? It has weirded me out that I can publish before pushing my latest changes, and the recent thread on securing the publish tokens reminded me of this. It would seem pretty simple to (optionally?) have crates.io check that published code matches the current version of a branch in the specified git repository. This would require an attacker to put code into a guy repository where it's likely for easier to observe. It could also provide additional security, e.g. if I configure GitHub with two factor authentication that would also protect crates.io.
This is an interesting idea, but it seems like something better suited to something like a linting tool (possibly cargo supply-chain) which compares the source code contained in crates against the release tagged in the
repository specified in Cargo.toml.
I am interested in implementing this. I have some pieces of it already implemented in the codebase of https://lib.rs.
I'm curious what the discrepancies will look like. There may be crates that use dynamically-generated code. I have some crates where I comment out git deps before publishing, because crates-io doesn't allow them, but crates installed from github do.
I would love to see a mechanism that confirms a correspondence between a published version of a crate and a signed git tag.
I've a couple crates that do this, at least one of which is published to crates-io.
I'm a big proponent that for consistent generated code (i.e. code that every compilation will generate identically), it's "nicer" to package the code in post-generation form, so users from crates-io don't have to run the (potentially complicated and more dependency using) generation code. After all, you publish once (per version) and many people download and use the crate. Also, if it's things like Unicode tables, you might want internet access, which is unnecessary for users of your library, just for packagers.
But the generated source still isn't source, so it makes sense that you wouldn't track it in VCS; the changes in the output are mostly immaterial to the content of the library itself. I've on-and-off played with ways of making it work seamlessly, so crates-io users can avoid the codegen stage, but
cargo build and git deps still work transparently.
That said, I'd still be happy to push a special git tag to represent a chain of trust that the source published is derived from packaging from my repository, without actually running the packaging code. Plus, it'd let me sidestep the bit of "yeah cargo I know you think I'm publishing dirty but also I think I know what I'm doing."
(The big one I'm trying to get to work truly seamlessly is writing derive crates that are always powered through watt, rather than being an alternative opt-in.)
(Also, it'd be really cool if ultimately some custom derives could be transparently pre-expanded during publishing, again so that library users don't have to compile and run the deriving code. But that's a further off dream.)
I should perhaps clarify that I'm mostly thinking of this as an opt-in protection. So you'd specify something in your Cargo.toml that indicates that the published version must correspond to the repository, and that flag from the previous published version would determine whether crates.io enforces correspondence with git on the next publish. So an author would this be able to protect their crate against attack via only access to the crates.io token.
Currently, crates.io does not require crates to specify that they have any repository. docs.rs has source browsing of what was published already (example: rand 0.8.3's source code, go to docs.rs/anycrate then open the menu labeled with the crate name and version in the upper left, it's under links -> source), so I'm not sure that adding repository verification to crates.io would be providing much additional value.
If you're more interested in viewing diffs of code between crate versions, check out the cargo-review-deps tool from Ferrous Systems.
I think better protection for this would be emailing every owner when a new version is published, which is much nearer to completion than repository verification would be.
It's much easier to browse code and code history on GitHub, and it's much easier to obtain that code and work with it locally. So when I have to "dig into" another crate, I always assume that the crates.io source matches the code in the VCS, and having the crates.io code on docs.rs does not help with that.
There are people who very much do not want to host their code on GitHub.
And those people could use another host. The point is that having access to version control history is valuable in reviewing code.
cargo-crev. You can do:
cargo crev open -u crate-name
to get actual published source locally, and review it. And as a bonus if you do
cargo crev review -u crate-name, you can share your findings with others, so that everyone doesn't have to review every crate themselves.
Fair, just do
s/GitHub/<VCS provider>/ then and what I said remains true. I've done the same with crates hosted on GitLab, for example.
I didn't know about that command; thanks for the pointer!
However, having a git repository with all its history is often still more informative, e.g.
git blame can be a great way to get a rough idea what is going on even in sparsely documented code.
cargo-download, I use this a lot to look at crate sources locally (including for doing things like verifying that it matches what is in the repository).
We do not currently require that publishers to crates.io specify any host.
Supporting all the possible VCS providers for this feature adds a large amount of complexity to the implementation.
Which is why I suggested this as an opportunity feature which developers can enable when they publish their crate.
My 2c: (extremely) complex opt-in features provide little value.
This is why I was suggesting it would be much more valuable to have a 3rd party linting tool which can work against any crate which declares a well-formed
repository field in its Cargo.toml metadata.
I am not sure I understand what exact problem the proposed verification would meaningfully solve.
- Easier code browsing of a selected crate version? Not only not everyone uses GitHub/GitLab, but some may not have a public repository in the first place. Even before that, I believe that crates.io should have a built-in code browsing, i.e. relying on repositories is simply a wrong tool for this job. (BTW I wonder why this was not added long time ago, even in a rudimentary form like on docs.rs)
- Protection from attackers as proposed in the OP? Attacker can create a commit in an obscure branch, or even commit without a branch altogether. There are various ways of how this "protection" can be worked around. If we are talking about access to a repository as a some kind of second authentication factor, then we should talk about a proper support of 2FA for crates.io, not about such ad-hoc solutions.
- Using signed git commits to verify that a crate version was indeed published by a trusted author? Again, I think it's a wrong tool for the job, we should instead add a proper crate signing/TUF.
Sure, just check it if there is one given.
If you support git, that should cover basically everything. No need to do anything GitHub/GitLab/...-specific.
I agree it's weak from security perspective. crates-io checking at publish time wouldn't be hard to work around. Even if the code was required to match the main branch or a tag, a malicious author could push code for a second, publish, and then take the code down before anyone notices. Continuous monitoring at random times would require published code to be there for longer.
I've already found cases where it's impossible to match the code:
Forks. There are many crates which are forks of other crates, and fork authors forget to update the
Deleted or renamed repos. There are a few crates where the repos are 404 (or private).