And indeed jars are zip files.
Zip has two distinct ways to get a list of all files both of which can be out of sync. This has led to a security issue in https://addons.mozilla.org where it was possible to hide files from reviewers while still having them executed by firefox itself: 1534483 - Ambiguous zip parsing allows hiding add-on files from linter and reviewers
My understanding is that one of the huge differences is that the kernel caches path traversal stuff in Linux but not in Windows, and that's not fixed with DevDrives. (It improves various things, particularly for people without a virus scanner exclusion, but AFAIK even with a DevDrive it'll still be slower in Windows.)
Interesting. I do want to get directory handles into std. Then those could eventually be used in rustc. It'd still incur the cost for leaf entries but not for all the ancestors.
Infuriatingly, pkzip is both a defacto industry standard used by a million tools and is also totally crap. Not just the redundant file lists, it's also the ambiguous central directory/unescaped archive comment, directory entries being solely flagged by being empty files ending in /, inconsistent support and usage of \ for /, by default only having MSDOS 2-second mtime precision, with multiple extensions with platform specific values that can and often are all present, and so on and so forth.
It's simply so easy to design, document and implement a minimal for-purpose archive format, that every time I see pkzip used I just feel sad.
Of course, a good, permissively licensed standard format would be better, but the pickings seem rather thin; most are focused on improving compression which makes them a bit too complicated for a straight .zip replacement. I'd love to hear suggestions!
One of my base assumptions here is that the existing .tar.gz
format was not compatible with random-access. Apparently progress has been made on this front. GitHub - mattgodbolt/zindex: Create an index on a compressed text file is a library for constructing an index and using the index to do random-access queries on .gz
files. It is intended for log files. But the concept should be extendable to identifying files in .tar
. Someone want to port it from a .c
cli to a rust library?
I don't really see any reason to bend over backwards to support .tar.gz
when .zip
isn't much worse and there are much better alternatives.
Is random access of the compressed form really necessary? The typical crate sizes are trivial to just decompress into memory and access from there.
The advantage of being able to support .tar.gz
is that we could get the advantages of this optimization on crates that have already been published in the existing .tar.gz
format. It also means that we can work in parallel on the two different efforts of the "don't put on disk" optimizations and the "improve the package exchange format".
Decompressing files is a noticeable part of clean builds. That's why cargo caches the outputs. Of course, I don't know how much of that is decompression verse disk I/O.
And also OS/FileSystem overhead. That's what the most brutal on Windows, AFAIK.
Certainly on spinning rust decompression is free, since they're so slow. Unsure if NVMe has gotten to the point that uncompressed is actually faster, especially given just how compressible crates tend to be, having lots of text files.
Some timings because I just setup my machine to have the registry/src/
folder regularly cleaned and was interested in what it actually means; this is for docs.rs which has 516 source folders extracted:
- clean
cargo check
with pre-extracted sources on zfs on nvme: 35s - clean
cargo check
with pre-extracted sources on tmpfs: 34s - noop
cargo check
with pre-extracted sources on both: 0.3s - noop
cargo check
with extraction to zfs on nvme: 3s - noop
cargo check
with extraction to tmpfs: 1.8s
The extra time taken to compile from sources on (a very fast) disk is very similar to how long extracting the archives to ram takes, so I wouldn't be surprised if the simple "decompress the whole archive into memory" is almost always a net win.
Interesting. But... How many runs did you average over and what is min/max/stdev?
5 runs each, the difference between runs was only ~1%. I didn't bother trying to record the data thoroughly since this isn't really setup to be that accurate or probably representative of typical setups, just some rough numbers to get a feeling for what it costs.
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.