I'm working for an open source project which cares a lot about building binaries which the community can reproduce by building themselves. Not only that, but we want historical binaries to be reproducible for a long time (years?). However, I've run into a cargo issue which makes this difficult.
Cargo produces hashes of metadata it then passes to rustc, which then uses it to compute crate disambiguators, which in turn uses it to compute symbol disambiguator suffixes. I've found in practice that symbol names (especially if every symbol name in a crate is changed) affect the final compiled output, even if those symbols are optimized out or are eventually stripped from the final binary, though I can't point to a particular reason why.
Cargo includes in its computed metadata hash the source url for the crate. This is problematic for reproducibility in posterity as git source urls can be relocated or become unavailable, and changing the URL in Cargo.toml will result in a different metadata hash, even if the same commit hash is available at a different url. Even the url of the registry crates.io depends on the stability of the British Indian Ocean Territories.
As far as I can tell, symbol mangling v2 does not change this, as it's out of scope of that RFC. I can see two potential solutions to this problem:
Evaluate whether mixing a crate's source url into its metadata hash is really necessary. Alternatively, for git, can we mix in the ref instead, and for registry packages mix in some package checksum?
(The Lazy Option) In the future, provide some Cargo mechanism to override git dependencies without changing the url used in the metadata hash. I believe this is already possible for registry packages, but not git dependencies.
Slightly shameful plug for a vaporware project which is presently failing its build due to a security issue:
But a shameless plug to the project it's based on:
I've thought long and hard about the problem of how to reproduce Rust builds, and there's been amazing work at the compiler level to make this happen. I think it's a great time to talk about actual build system tooling for this problem.
You seem to have identified a problem with git sources in general:
This is problematic for reproducibility in posterity as git source urls can be relocated or become unavailable
My solution to this particular problem in Synchronicity is: don't. The initial scope is crates published to crates.io, which I think have some best-in-class immutability properties when it comes to a packaging system in general, and my understanding of the "crates.io contract" is they will not unpublish a crate unless there is a compliance reason to do so, so de facto "never". The crates.io index also contains SHA-256 hashes of all of these packages.
If you're trying to write a reproducibility tool for Rust crates, I think the solution for git sources, at least for the time being, is simply to place them out-of-scope.
Perhaps the hashing of Rust source code should be fully "content addressable", but I also think the "always a crates.io build" recipe for reproducible builds is the only pragmatic one for Rust for the time being and also one which is poorly explored in and of itself. I feel like reproducible build tooling is in the "learn to crawl before you learn to walk stage" and handling git dependencies is something best left until we have some rudimentary "reproduce a build for an arbitrary crates.io-published crate and given version of the Rust compiler and Rustwide docker image" tooling in place.
Understood, but I think it's worth the effort to think about what should be done now and going forward. Short of full "content addressability", I'm suggesting using the git commit hash instead of the repo url as a small incremental improvement to eliminate a big reproducibility footgun today. However, I'm not sure why the current method was chosen, and can't find any explanation in the source (as it seems like it was actually decided on pre-1.0 here and here).
I don't think it makes sense to incorporate any hash sourced from Git into reproducible builds until they fully complete a migration from SHA-1 to SHA-256, which is a project which has been straggling since the original SHAttered attacks were published in 2017.
Fair point, but note that SHA-1, and indeed even MD5, are only broken with respect to collision resistance -- i.e. a committer may be able to craft two commits with the same hash, given that they can place an arbitrary byte blob somewhere in both of them without anyone noticing. They both have retained their preimage resistance -- i.e. finding a hash collision for a commit one did not author. This weakens the attack a bit: assuming the malicious commiter does not have write access to the repo (which would mean including the source url in the metadata hash serves no defense), they have to have write access to a mirror of the repo and convince dependent crates to switch to it.
Not to mention, I think this should be considered an upstream git issue, which will presumably be fixed before or when a real attack is shown (there have been problems crafting an actual attack as Linus has brought up). That shouldn't prevent Rust from relying on the commit hash as a form of (secure?) content addressing as it was intended to be.
Can you cite a paper which corroborates this? My understanding is arbitrary chosen prefix collisions of this nature could plausibly undermine the protections of "hardened SHA-1".
In any case, I don't think it makes any sense to be incorporating SHA-1 into systems being designed in 2020.
The very paper for this month's attack: https://eprint.iacr.org/2020/014.pdf page 25: "As a stopgap measure, the collision-detection library of Stevens and Shumow [SS17] can be used to detect attack attempts (it successfully detects our attack)."
And page 28 of the same document: "The GIT developers have been working on replacing SHA-1 for a while16, and they use a collision detection library [SS17] to mitigate the risks of collision attacks".
If possible, I'd like the topic to steer toward this perspective, lest we risk the thread becoming no longer relevant to Cargo and thus not belonging on the IRLO forums. In other words, evaluating whether [insert content addressing mechanism here] is even a viable replacement for the repo url in the crate metadata hash calculation.
Note that GitHub stores all forks in the same internal object repo as the original, so you only need to push to your own fork to make your object accessible everywhere. That's GitHub's problem more than git itself, but still.
Anybody who really cares about being able to rebuild ancient versions will be vendoring everything. That time I worked on a project that did (not Rust), the build server had no access to any network except the version control system(s) and all sources, including third party and even the toolchain installers were in the version control. Because a company that has obligation to be able to rebuild ancient versions simply can't rely on any other party whatsoever. That said
Well, that's the obvious solution. A hash in the index must be verifiable by the client after fetching the package, so it is obviously possible to calculate and verify that hash for package pulled from anywhere else too.
my understanding of the "crates.io contract" is they will not unpublish a crate unless there is a compliance reason to do so, so de facto "never".
crates.io has deleted crates in the past and make no guarantees they will not do so in the future. docs.rs has already run into trouble with this, see for example Discord (sorry for account wall, the gist is that an author published thousands of crates in a day and had them all removed).