The State of Rust Tarballs

The delivery mechanism of the tarball/filetree is a separate layer of concern. It’s the layout of the tarball that I’m concerned with, how its delivered can be up to user preference (http, rustup+http, personal S3, torrents, IPFS).

1 Like

I wrote the original index generator. The primary purpose at the time was to reduce the number of times the S3 API is called to keep the costs of index generation down. It appears that the cost savings ended up not being justified, especially given that in the end to generate the index at all, every single file in the bucket must be listed – and they are very numerous, not to mention that the pace they are added at only increases!

In the end the problematic and serial part of the index generation is making an API call per 1000 files in the bucket. Eventually that adds up to a lot of time spent doing mostly nothing and the problem will only get worse over time.

Now, for dated directories generating a listing is not too much of a problem, because S3 API supports filtering the listing by a certain prefix (so e.g. filtering by dist/2019-01-15 will only return files in said directory). This is why listing for date directories works and is possible. Alas, the same is not possible for the dist/ directory itself because it has not an unique prefix that could be used to filter the files there!

I believe that changing the directory structure somewhat to make sure that “current” dist/ does not end up containing the dated subdirectories would resolve essentially all of the blockers here, but it would also end up breaking all the tooling that assumes the current structure…

I’m a commenter on https://github.com/rust-lang/rust/issues/56971

I unfortunately don’t have any suggestion on how to resolve this issue with S3, but I can explain my own stake in this tarball problem and petition for minimal change.

The Bazel rules_rust rules set, which implements partial support for the Rust language with the Bazel build tool, is currently dependent on both the generated index (for maintainers to discover files and understand the organization) and on the current directory structure itself. It would be unpleasant for us if index stopped existing indefinitely, the organization of directories within tarballs changed, or the dist directory structure changed.

I expect any tools that manage their own toolchains to have similar concerns about changes in any of these formats.

I think any method that allows S3 to remain the source of truth for the directory listing will lay the same trap, except just further down the road. S3 is great for setting and getting individual files but it is not good to use as an index or treat it like a filesystem.

The source of truth of “which files exist, what is their structure, and where are they” should live elsewhere, as these are questions that S3 is bad at answering. You could have this TOC live in S3 at a known location/prefix, though then you also subject yourself to S3’s eventual consistency on file updates making it difficult to know when you can regenerate the static index.

1 Like

I’m also curious what all the artifacts in S3 are, as it doesn’t seem like Rust has been around long enough to overwhelm a bucket with numerous files. Are more than just {n tarballs} * {m targets} getting stored in the dist/ prefix?

It’s on the order of {n targets} * {m components} * {l files} * {k days} where n~70, m~3 (std, rls, etc.), l~6 (tar, hash, signature, etc.), and k~365*5. Fudge a little bit for targets being added over time (n), components per target changing over time (m), and not counting stable/beta releases. That’s about 2 million.

This was discussed in the infra team meeting yesterday and @pietroalbini suggested automatically keeping the index up to date using SQS.

What’s “the index” here? Static HTML?

Is there a benefit to getting the index information from S3 (regardless of how it’s done) as opposed to having the build system producing the artifacts and posting them to S3 update the index when it posts them?

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.