Cargo's crate index: upcoming squash into one commit


To make it clear: I agree that there is a technical reason for squashing the index into one commit, but I think there is none to discard the legacy history. So while squashing is okay, please don’t discard the legacy history, not “in the immediate future” or afterwards. If somehow the fat branch leaks onto the skinny branch (idk heard that the git protocol sends some stuff that the cloner might not be interested in at all), then put it into a different repo. If cargo stops using git, please press the “archive this repository” button, not the “delete this repository” one. Thanks.


Note: using git gc --aggressive seems to reduce the size of the repository from about 108MB to 49MB, which isn’t as good as doing a squash but doesn’t result in losing all of it’s history either. Was that considered?

Github, to my knowledge, doesn’t have a way of running this, but deleting and recreating the repository would work.


@est31 I don’t think anyone was suggesting deleting the history. Thank you for so clearly articulating why that would be a bad idea.


@alexcrichton I might have missed it, but I didn’t see this addressed above: gets an update several times an hour. This means that in order to not miss an update while squashing the history you need to either freeze for some time (unless there is some more elaborate scheme). Are you planning to take down for some time?


The use of git push --force-with-lease=refs/heads/master:$the_earlier_commit means that the force push will only succeed if no new commits (package updates) are present upstream. Presumably if this fails, @alexcrichton will just rerun the script until it succeeds.


Perhaps changing distribution method of the first checkout doesn’t count as switching to something else?

What if instead of cloning the initial repo state from GitHub, Cargo would obtain a heavily GC’ed, heavily compressed version from somewhere else? (as a tarball.xz for example)?


Cargo could certainly download a git bundle from a CDN. That wouldn’t necessarily speed things up, though; github already has heavy CDN coverage. Perhaps worth a test, but at the same time, perhaps not worth the extra complexity in cargo.


This is certainly a possibility! As mentioned a few times previously in this thread though rethinking larger portions of how Cargo works isn’t really in scope for this topic here, but it’d be good to discuss in a separate thread!


While there is no button for it in the UI, you can ask GitHub support to do a gc for you, and I haven’t heard of a case where they refused to do it. I don’t know the details of what the --aggressive flag entails but it’s possible they can do it as well.


The RFC’s going to have to address this before it can be merged (not the other way around).


Ok and this is now done!

For posterity, this is the script that I used:

set -ex

now=`date '+%Y-%m-%d'`
git fetch origin
git reset --hard origin/master
head=`git rev-parse HEAD`
git push -f $head:refs/heads/snapshot-$now

msg=$(cat <<-END
Collapse index into one commit

Previous HEAD was $head, now on the \`snapshot-$now\` branch

More information about this change can be found [online]


new_rev=$(git commit-tree HEAD^{tree} -m "$msg")

git push \ \
  $new_rev:refs/heads/master \


Where you going to add the readme at the same time, or do it separately? It would be nice to have it clearly documented as a internal detail of cargo and not open to PRs.


Any time should be fine!


If you keep the branch with the snapshot in the same repo, the download requirements won’t change (unless you avoid doing a default clone and request only the master branch).


I believe cargo just fetches the branch, not a full repo clone.


@alexcrichton, this is not quite true due to it being documented in (albeit that has not landed properly yet).


True! I think though we’ll always reserve the right to change it for at any time (while preserving compatibilities for other registries and such)