Stumbled upon this blogpost today, and it immediately made me wonder if applying this to Rust tooling would yield significant benifit to compile times, and more importantly - rust-analyzer startup times.
At a glance, it feels like it would be, with how common it is to have the total number of direct+transitive dependencies in the hundreds.
The blog post makes it sound like the gain came from the .tar part of the .tar.gz, rather than the .gz part – the overhead was due to the number of files that needed to be opened, rather than the number of bytes that needed to be read, so having one file with them all concatenated reduced I/O overhead because the kernel could copy them all in one chunk. (Using uncompressed .tar might potentially also make an mmap approach faster than a read approach – mmap is generally slow when working with lots of small files, but using a single .tar gives you one large file.)
I wouldn't expect it to have that much impact on a Rust compile, though – the change improved the speed of the lexer by 2.3×, but lexers are already very fast and would be only a very small non-bottlenecking part of a typical Rust compile. I imagine the total impact taking into account slower parts of the compile would be minimal (probably even with check builds, even though they skip the slowest parts).
Yeah, gzip is slow enough that if you're just going for raw speed it's probably not worth it. That's where something like LZ4 (compression algorithm) - Wikipedia comes in -- though of course doing that in the filesystem instead is often better than making the application layer deal with it.
But of course this isn't really anything new. This is why games -- especially on windows -- have used custom archive formats for assets for 30+ years.
I should probably edit the title of the thread to avoid making an impression like I'm actually suggesting gzip. Though, if that was an option, I know it would see some use for the space savings.
On many file systems you can enable transparent compression on the file system level. BTRFS supports this (with selectable compression algorithm, including zstd), as does NTFS on Windows (with a single fixed algorithm). I think ZFS might support it too, not sure. I have no clue about MacOS X.
This is generally better than handling compression at the application level, as it will be uncompressed into the page cache, which will mean mmap still works and that allows multiple programs to share the same uncompressed pages.
Also: if you are to use archives (primarily useful for Windows and maybe Mac, Linux punishes you much less for many small files), consider something akin to zip instead of tar. Tar doesn't have random access (even when uncompressed). There is no directory listing in it, it is more like a linked list of records.
Zip on the other hand does have a directory listing (though it is at the end of the zip file and a bit weird). I believe each file is compressed individually, which allows for random access as well.
Yup, that's how I know of LZ4 -- it's the default compression in ZFS because it's so fast that it's essentially always a win, whereas stronger compression isn't necessarily worth it depending on what you're usually storing in the pool.
Interesting discussion, doesn't look like anyone touched on the idea that this could be particularly useful for Rust-Analyzer.
Also the decompression impact discussion has the blind spot around the middle ground between a bunch of files and decompression in rustc: if .crate is a gzipped archive, the decompression could happen at cargo fetch time, then other tooling would work with contiguous .crate files.
I would like to see benchmarks of this (across all the major host platforms used for development). And consider the downsides too. Currently RA can (recursively) navigate to any code in a dependency or std when I ctrl-click in vscode. Would that still be possible in a zip file? I don't know that all the popular editors are able to seemlessly browse into archives.
I consider this incredibly powerful, and it is something I miss when working in C++, where I get to a header and that is it for system dependencies. I would not consider it worth giving this up for any speed advantage whatsoever.
I could imagine a feature where it just materializes a temporary file only when you try to go to definition for some librar. With editor support it may not even need to create a file on disk but just load the contents into a buffer (to use neovim terms). Seems like something that could be part of the LSP protocol
rustc is so slow that I really doubt that this has any meaningful effect on rustc compile times. For r-a it could be more possible, but I have never seen an r-a profile so I don't know much about it. Could be worth measuring anyways if anyone is interested.
r-a loads all source files in the workspace into memory at startup and after that it generally gets changes directly from the editor rather than from the filesystem as only the editor knows the still unsaved state that r-a needs to operate on.
While I currently use vscode, does this work for other editors such as zed and helix? (I'm considering to switch to one of those to get away from the constant Copilot pushing in vscode these days.)
My point is that there are a lot of different editors that LSPs need to work with nowdays.
Is disabling all the copilot stuff and uninstalling the copilot extension sufficient? There’s a “Disable AI Features” setting.
(Copilot stuff getting in my way was driving me mad, almost pushed me away from VS Code, until I figured out how to disable it. Hopefully they don’t try to push it harder by making it impossible to disable… I’d hope that someone involved realizes that that’d be a dumb decision.)
I actually like the smarter tab completion, it is all the other stuff like agent mode I dislike. You can disable it, until an upgrade adds some new variant you then need to disable separately.
Additionally they broke font rendering in recent versions, (zed is also very broken still as well unfortunately), so I'm looking at migrating to something that properly handles bitmap fonts. Helix in a sensible terminal is one possibility, I'm not sure yet.
It does not for me, I use either full hinting with grey scale AA or even bitmap fonts with no AA. Any text that is even slightly blurry causes me headaches within tens of minutes. And the colour bleed from subpixel AA is even worse.