"Jar" for Rust: single file crate support for `rustc`

During a post-conf-un-conf conversation with people in the cargo team, the idea of having rustc be able to read .crate files as a single monolith instead of forcing people/cargo to unzip them was brought up. In general, it doesn't buy much, until you remember that file system IO in Windows is dramatically slow when reading a lot of small files, the exact case that crates represent. On the rustc side we are already dumping every file that we read into a contiguous buffer, so this wouldn't be as disruptive a change, IMO.

Thoughts?

11 Likes

It sounds promising as long as:

  • build.rs doesn't use std::fs to inspect the sources; and
  • neither do proc macros (e.g., graphql_client_codegen reading a query/schema file).

These are not all that common, but I feel it means that Cargo.toml needs an opt-in for "can be built without extraction" rather than an opt-out.

Edit to add: cargo publish could verify this flag setting as well.

It also means:

  • the cache only contains hash-verified files (once extracted, knowing "is this what I think it should be" is much harder);
  • error reporting for in-crate code will cause editors with not-sophisticated-enough jump-to-error support to get very lost.
4 Likes

For the error reporting, if an error is found, wouldn't it be possible to unzip the .crate into read-only file that would only be used for the editor to jump into the error ?

That means that cargo is interpreting and editing rustc output. I suppose --> <filename>:<line>:<column> can't show up anywhere else given the "quoting" of user code that rustc does (ANSI color may interfere here too?), so that would be one possibility. But it is something that would need to be considered with such an approach.

The other side to this besides Windows file IO performance is ensuring the package is read-only.

One possible problem is IDEs doing "go to definition".

This could be supported with RFC 3200.

Even vim is capable of showing files within archives, I would certainly hope that more modern tooling is too.

4 Likes

For proc-macros and build scripts, the ultimate solution is sandboxing with packages declaring the capabilities they need for doing things outside of the sandbox. We aren't there yet though. In the per-user cache discussions, I've brought up a manifest attribute that is a "pinky swear" that the build script or proc macro is "pure" and doesn't access any outside resources that would change between root packages or systems.

I wonder how much we'd gain even if we didn't do any "pinky swear" and cargo still extracted .crate files when a build script is involve or the package has a proc macro in its dependency tree (to protect against re-exports).

4 Likes

Glad to here though there would likely be a transition cost we should be aware of.

1 Like

If cargo extracted the .crate when a build script or proc macro is involved, but still used the .crate for invoking rustc itself, that would at least give the benefit of making it impossible to accidentally modify the source consumed by rustc itself and harder to do it deliberately.

8 Likes

It would be a huge step forward if rustc could read a crate file. It would be even better if Rustc could ask cargo for a buffer containing the contents of a particular file, and leave the file extraction to cargo. In a sans-IO style.This would allow cargo to use the same optimization for git dependencies and make it easier for cargo to change the archive format of a crate file in the future.

Even more excitingly for normal path dependencies cargo is forced to use the famously unreliable mtime to guess if a file still contains the same contents that rustc read. Using a file hash is more difficult than it sounds. Mostly because cargo can determine that the hash changed between when it started rustc and when rustc completed, but not which one rustc saw. There are other solutions to this problem, but cargo doing the file I/O is more elegant.

2 Likes
Side note for entertainment purposes only

if you stretch the definition of "single" a lot you can already have single file crate on Windows using file streams (i.e. having multiple data streams on the same file). However, annoyingly the file has to be called "Cargo.toml" because Cargo strongly enforces this and there doesn't appear to be a way to override it.

Thank you @Eh2406! I remembered there was another potential application of virtual file systems but forgot what it was.

2 Likes

Does putting the .cargo dir on a dev drive solve the problem or is that still slower than linux?

Well, how about setting files and directories to read-only? Filesystems DO support this stuff...

Yes, this has been investigated off and on with various problems: Consider making the `src` cache read-only. · Issue #9455 · rust-lang/cargo · GitHub

It would be a huge step forward if rustc could read a crate file. It would be even better if Rustc could ask cargo for a buffer containing the contents of a particular file, and leave the file extraction to cargo.

I understand in some companies they use build systems other than cargo, so adding a rustc -> cargo dependency doesn't seem like a good idea.

1 Like

It can be a generic interface which can be used by everyone invoking rustc. Rust-analyzer could also benefit from it a lot for doing cargo check without having to save files.

5 Likes

This is a usecase that we cater to and wouldn't want to stop doing so. Having rustc learn to optionally read a .crate file is one avenue, having cargo pass in the bytebuffer with the contents of the entire crate to rustc is another (more likely) one, and the later would add no extra dependency on cargo, it'd be just another way of passing information from a build system to rustc, without having to hit the disk. This could be implemented with named pipes, sockets, or just plain ol' stdin. Either way, buck and bazel could do the same as cargo and for end users there would be no difference.

There's also a conversation about "rust scripts"/"single file rust projects", and we could make the argument that that use-case could/should be supported by this machinery (not sure I agree with it, but you could make that argument).

4 Likes

.crate is a gzipped tarball, so it doesn't support seeking well. Rustc will need to read source files in whatever order modules want them, so it'll probably end up buffering the files somewhere.

ungzip is not free. Compilation is already CPU-bound, so it may be costly.

crate files on crates.io sometimes contain big useless files (the target dir, or unit test fixtures), which in a tarball can't be cheaply skipped over.

Some kind of "jar" could be useful. Maybe even crates.io could use an improved file format. The current crate tarball format isn't well suited for this.

1 Like

Yeah it would have to be an uncompressed format. That would enable cheap seeking and also mmap (if needed). On CoW filesystems and some careful alignment and sparse file trickery it might even be possible to share file blocks between the archive and unpacked directory tree.

For what it's worth, it's not strictly necessary to go uncompressed to skip over files. One option is .zip, where you can cheaply skip over unwanted files because files are compressed independently (at the cost of compression ratio). There are also variants of zstd (part of upstream repo) and xz that let you truly seek anywhere.

5 Likes