"Jar" for Rust: single file crate support for `rustc`

jdahlstrom · September 19, 2023, 12:57pm

And indeed jars are zip files.

bjorn3 · September 19, 2023, 2:35pm

Zip has two distinct ways to get a list of all files both of which can be out of sync. This has led to a security issue in https://addons.mozilla.org where it was possible to hide files from reviewers while still having them executed by firefox itself: 1534483 - Ambiguous zip parsing allows hiding add-on files from linter and reviewers

scottmcm · September 19, 2023, 7:12pm

My understanding is that one of the huge differences is that the kernel caches path traversal stuff in Linux but not in Windows, and that's not fixed with DevDrives. (It improves various things, particularly for people without a virus scanner exclusion, but AFAIK even with a DevDrive it'll still be slower in Windows.)

the8472 · September 19, 2023, 7:35pm

Interesting. I do want to get directory handles into std. Then those could eventually be used in rustc. It'd still incur the cost for leaf entries but not for all the ancestors.

simonbuchan · September 25, 2023, 10:01am

Infuriatingly, pkzip is both a defacto industry standard used by a million tools and is also totally crap. Not just the redundant file lists, it's also the ambiguous central directory/unescaped archive comment, directory entries being solely flagged by being empty files ending in /, inconsistent support and usage of \ for /, by default only having MSDOS 2-second mtime precision, with multiple extensions with platform specific values that can and often are all present, and so on and so forth.

It's simply so easy to design, document and implement a minimal for-purpose archive format, that every time I see pkzip used I just feel sad.

Of course, a good, permissively licensed standard format would be better, but the pickings seem rather thin; most are focused on improving compression which makes them a bit too complicated for a straight .zip replacement. I'd love to hear suggestions!

Eh2406 · October 24, 2023, 7:49pm

One of my base assumptions here is that the existing .tar.gz format was not compatible with random-access. Apparently progress has been made on this front. GitHub - mattgodbolt/zindex: Create an index on a compressed text file is a library for constructing an index and using the index to do random-access queries on .gz files. It is intended for log files. But the concept should be extendable to identifying files in .tar. Someone want to port it from a .c cli to a rust library?

pitaj · October 24, 2023, 8:32pm

I don't really see any reason to bend over backwards to support .tar.gz when .zip isn't much worse and there are much better alternatives.

Nemo157 · October 24, 2023, 8:49pm

Is random access of the compressed form really necessary? The typical crate sizes are trivial to just decompress into memory and access from there.

Eh2406 · October 24, 2023, 9:13pm

The advantage of being able to support .tar.gz is that we could get the advantages of this optimization on crates that have already been published in the existing .tar.gz format. It also means that we can work in parallel on the two different efforts of the "don't put on disk" optimizations and the "improve the package exchange format".

Decompressing files is a noticeable part of clean builds. That's why cargo caches the outputs. Of course, I don't know how much of that is decompression verse disk I/O.

scottmcm · October 24, 2023, 9:44pm

And also OS/FileSystem overhead. That's what the most brutal on Windows, AFAIK.

Certainly on spinning rust decompression is free, since they're so slow. Unsure if NVMe has gotten to the point that uncompressed is actually faster, especially given just how compressible crates tend to be, having lots of text files.

Nemo157 · October 24, 2023, 9:49pm

Some timings because I just setup my machine to have the registry/src/ folder regularly cleaned and was interested in what it actually means; this is for docs.rs which has 516 source folders extracted:

clean cargo check with pre-extracted sources on zfs on nvme: 35s
clean cargo check with pre-extracted sources on tmpfs: 34s
noop cargo check with pre-extracted sources on both: 0.3s
noop cargo check with extraction to zfs on nvme: 3s
noop cargo check with extraction to tmpfs: 1.8s

The extra time taken to compile from sources on (a very fast) disk is very similar to how long extracting the archives to ram takes, so I wouldn't be surprised if the simple "decompress the whole archive into memory" is almost always a net win.

Vorpal · October 24, 2023, 9:59pm

Interesting. But... How many runs did you average over and what is min/max/stdev?

Nemo157 · October 24, 2023, 10:03pm

5 runs each, the difference between runs was only ~1%. I didn't bother trying to record the data thoroughly since this isn't really setup to be that accurate or probably representative of typical setups, just some rough numbers to get a feeling for what it costs.

system · January 22, 2024, 10:04pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cargo fix single-file mode tools and infrastructure	5	416	September 10, 2024
Building tools on top of cargo/rustc for inspecting crates	7	1349	March 25, 2019
[Pre-RFC] Cargo-useable format for prebuilt libraries cargo	0	1153	February 14, 2020
[Pre-RFC] Caching improvements and system-wide crate cache cargo	1	817	June 18, 2020
Should Cargo be written in Rust? cargo	13	6948	March 25, 2019

"Jar" for Rust: single file crate support for `rustc`

Related topics