Source Data
Let’s discuss the most basic thing on top of which everything else is built: source code.
First of all, I want to mention that an obvious idea of “single Cargo workspace defines sources” does not work that well. People do open unrelated projects on the opposite corners of the hard drive and expect “find usages” to work transparently across them. There’s also a curious case of rust files which are not actually a part of a workspace because mod declaration is missing.
So a model of the source code should be closer to a “set of root folders in various places of the file system”.
Next unpleasant thing is that file systems are complicated: filenames could use different encodings, there could be link cycles, reading a file might fail for so many reasons, and the state of the file system can change under your feet.
On the bright side, source code is a text written mostly by humans, it’s not Big Data. The total size of all Rust files in my ~/.cargo
is only 35 megabytes. That most certainly can fit into memory.
With all that in mind, a reasonable model of the source data seems to be a set of source roots, where each source root is an in-memory idealized file tree, basically, a HashMap<RelativePath, String>
. It’s important to keep paths relative and store a root directory of source root elsewhere: that way, we guarantee that compiler doesn’t do IO bypassing source roots, and we also handle cases where paths contain some non-unicode components. This model should have methods to update it. So, something like this:
type RelativePath = String;
type FileTree = HashMap<RelativePath, String>;
type RootId = usize;
struct Sources {
roots: Vec<FileTree>,
}
impl Sources {
fn new() -> Self;
fn add_file_tree(&self, FileTree) -> (RootId, Self);
fn change_file_tree(&self, RootId, FileTree) -> Self;
fn get_text(&self, RootId, RelativePath) -> &str;
}
The Sources then becomes the input to the compiler, who doesn’t do any IO itself. For language server use-case, the sources are maintained by merging state of real file system and editor overlays.
For command-line compiler, loading all files upfront does not seem like a good idea. However, I think we can both implement and on-demand loading from fs and keep io::Error
out of the API by returning an empty content in case of errors and stuffing errors somewhere, and reporting them in the end of the compilation.
This model is workable (almost: see later on proc-macros) in a sense that everything in Rust is distributed via source code, so we don’t need additional information. However and important optimization would be to use pre-compiled crates: for google-scale codebases it is important to be able to download pre-analyzed dependencies from the build-server instead of analyzing everything from source on the local machine. So, we also need
struct CrateMeta { ... } // something spiritually similar to save analysis
impl Sources {
fn add_crate_meta(&self, CrateId, CrateMeta) -> Self
}
And I think the last missing piece of a puzzle here are proc-macros. We need to be careful here! Consider a case where we work on two crates, one is a proc-macro, and another uses the first one. I think it would be a misfeature if modifying the first crate would result in an immediate recompilation of the second crate. In theory, that would be great, but in practice I think this would result in compiling too much stuff only to figure out that the user haven’t finished with the proc macro yet. I think the better user-experience here would be to require an explicit user-initiated full build of the proc macro crate for changes to take effect on the downstream crate. That means that proc-macros are provided to compiler as inputs, and not directly derived from sources, which gives us a following API:
impl Sources {
fn add_proc_macro(&self, Box<Fn(TokenStream) -> TokenStream>) -> Self
}