Next steps for RLS, Rustc, and IDE support


#1

Over in the 2019 Strategy for Rustc and the RLS thread, I wrote that we should break out a thread to discuss strategies for IDE support.

I want to start by giving a big shoutout to all the folks who’ve contributed to the RLS. We’ve come really far very fast and it’s great to see.

It seems to me though that we are now at a bit of an impasse. In particular, to make big strides forward, we really need a compiler to start lifting more of the weight.

I had always hoped this would be rustc, using queries for fast latency and incremental to avoid doing too much work. I still think — long term — this is a laudable goal and probably achievable. On the other hand, I worry that it is not something we’re going to realistically have working short term.

One other thing I think is worth talking about is apportioning responsibility. I sort of feel like the compiler team has been shirking some of its responsibility, and that we ought to make RLS support more of a “first class” concern, though I’m not sure what shape that takes: it might be stronger API boundaries, more tests, or it might be kind of merging the RLS + rustc teams together into something more unified.

Discuss. =)


#2

Following up on my opening comment (emphasis added):

This is why I am intrigued by @matklad’s ideas around libsyntax2. I can see the appeal of starting a “from scratch” replacement focused on the front end part of the compiler, with the eventual goal of it being adopted as the “master parser” for not only the RLS but also the compiler, syntax extensions, etc. I think we’d have to plan carefully though to ensure that it doesn’t “drift on its own” for too long, though.

In my ideal world, we would plan out an end-point that we want to reach – an end-point with a single unified compiler working in a particular way – and a path to get there, such that while we initially are maintaining both old + new AST / front-end, we have a timeline and plan to merge them. It’s just that it might not make sense to do that too early: perhaps it’s better to prove out the new design in the RLS first.

(As for things like macro expansion, name resolution ,and even type-checking, in my ideal world, we would be re-implementing them as separate libraries that can be shared, with an emphasis on incremental and so forth. I’ve not thought hard about this part.)

The danger of course is an increased maintenance burden. On the other hand, we are currently supporting both rustc and racer, to some extent (the latter of which is — as I understand it — a kind of independent re-implementation of various parts of rustc for the purpose of doing auto-completion). (I don’t know much about racer, but my assumption is that was never designed with the goal of building into a full compiler, and hence wouldn’t serve as a good starting point for this sort of work.)


#3

That… is a brilliant observation that we’ve completely missed in that other thread :slight_smile: (and some folks, cough, cough, me, just jumped straight to suggesting solutions). Let’s draw an ideal compiler architecture in some level of detail, and then plan an optimal path towards it? Optimality should consider at least two metrics, probably more:

  • total amount of work
  • time to market (ie, how fast can we provide incomplete, but reliable and fast completion suggestions from the compiler).

I think we all agree that we want to see a set of libraries in the end, but that is not too actionable: all software consist of components.

Perhaps it makes sense to focus on core data structures and data flow instead?

Roslyn Overview provides a good example how it might look like.


#4

Indeed I’ve been thinking that we really ought to be studying Roslyn more closely.


#5

And not only Roslyn.

dartanalyzer is also a very interesting project to study. In particular, it’s server API could be a model for what a high-level compiler API should be capable of. It includes stuff like asynchronous completions, asynchronous searches, task and tests running, analysis subscriptions and file system overlays: all those important things that LSP misses.

Kotlin is also a great example of compiler-ide project. Sadly, they don’t seem to have any public compiler architecture docs, and I know only scattered bits and pieces.


#6

Source Data

Let’s discuss the most basic thing on top of which everything else is built: source code.

First of all, I want to mention that an obvious idea of “single Cargo workspace defines sources” does not work that well. People do open unrelated projects on the opposite corners of the hard drive and expect “find usages” to work transparently across them. There’s also a curious case of rust files which are not actually a part of a workspace because mod declaration is missing.

So a model of the source code should be closer to a “set of root folders in various places of the file system”.

Next unpleasant thing is that file systems are complicated: filenames could use different encodings, there could be link cycles, reading a file might fail for so many reasons, and the state of the file system can change under your feet.

On the bright side, source code is a text written mostly by humans, it’s not Big Data. The total size of all Rust files in my ~/.cargo is only 35 megabytes. That most certainly can fit into memory.

With all that in mind, a reasonable model of the source data seems to be a set of source roots, where each source root is an in-memory idealized file tree, basically, a HashMap<RelativePath, String>. It’s important to keep paths relative and store a root directory of source root elsewhere: that way, we guarantee that compiler doesn’t do IO bypassing source roots, and we also handle cases where paths contain some non-unicode components. This model should have methods to update it. So, something like this:

type RelativePath = String;
type FileTree = HashMap<RelativePath, String>;
type RootId = usize;

struct Sources {
   roots: Vec<FileTree>,
}

impl Sources {
  fn new() -> Self;
  fn add_file_tree(&self, FileTree) -> (RootId, Self);
  fn change_file_tree(&self, RootId, FileTree) -> Self;
  fn get_text(&self, RootId, RelativePath) -> &str;
}

The Sources then becomes the input to the compiler, who doesn’t do any IO itself. For language server use-case, the sources are maintained by merging state of real file system and editor overlays.

For command-line compiler, loading all files upfront does not seem like a good idea. However, I think we can both implement and on-demand loading from fs and keep io::Error out of the API by returning an empty content in case of errors and stuffing errors somewhere, and reporting them in the end of the compilation.

This model is workable (almost: see later on proc-macros) in a sense that everything in Rust is distributed via source code, so we don’t need additional information. However and important optimization would be to use pre-compiled crates: for google-scale codebases it is important to be able to download pre-analyzed dependencies from the build-server instead of analyzing everything from source on the local machine. So, we also need

struct CrateMeta { ... } // something spiritually similar to save analysis 
impl Sources {
    fn add_crate_meta(&self, CrateId, CrateMeta) -> Self
}

And I think the last missing piece of a puzzle here are proc-macros. We need to be careful here! Consider a case where we work on two crates, one is a proc-macro, and another uses the first one. I think it would be a misfeature if modifying the first crate would result in an immediate recompilation of the second crate. In theory, that would be great, but in practice I think this would result in compiling too much stuff only to figure out that the user haven’t finished with the proc macro yet. I think the better user-experience here would be to require an explicit user-initiated full build of the proc macro crate for changes to take effect on the downstream crate. That means that proc-macros are provided to compiler as inputs, and not directly derived from sources, which gives us a following API:

impl Sources {
   fn add_proc_macro(&self, Box<Fn(TokenStream) -> TokenStream>) -> Self
}

#7

Note that the Rust compiler already does something, e.g. the source_map::FileLoader trait.
(what should we do about accidentally designing something from scratch, due to not having fully studied the current architecture? there are serious risks of reinvention here - we should get more people involved, at least)

While making everything source-based and relying on incremental state instead of rlibs would actually be kind of amazing, it’s not a short-term priority.
It’s what @nikomatsakis calls “multi-crate sessions”.

I think “crate” separation is, sadly, suboptimal in several respects, despite having served us really well given our tradeoffs.
But it will take some time to shake that off.


#8

Definitely agree that compiler & IDE people should work closely together to avoid needless duplication. However the specific case of FileLoader signifies that I failed to bring a point across in my previous post, which aimed to highlight the differences from current FileLoader setup, of which I am aware of :slight_smile:

The main thing is having an explicit immutable value for the state of the world to which you can apply changes (the design stolen from Roslyn). Current FileLoader gives a specific snapshot, which is all you need for a command-line compiler, but which just doesn’t have API to talk about changes.

The second important thing is removing IO (and io::Result) from the API: for a long-lived process which deals with values changing over time IO seems to be a liability.

The third nice but probably not crucial thing is replacing OS-specifc absolute paths with something more structured.


#9

Syntax Model

The second important data structure is syntax model. I wouldn’t go into details here, but https://github.com/dotnet/roslyn/wiki/Roslyn-Overview#working-with-syntax and https://github.com/apple/swift/tree/d2560e7513460d7e12181256b583fef76d587962/lib/Syntax give a good overview of constraints and possible solutions.

What is sort-of unique for Rust though, is that it might be hard to map from syntax to semantics. In C# (I might be wrong here), you can map a source file to a single compilation unit from the semantic model.

In Rust, the same source might appear several times in the semantic model, for the following reasons:

  • a single file is included as a submodule by two different modules (mod foo; #[path="fo.rs"] mod bar;)
  • a single file is included as submodule by several crates (utils.rs shared by both lib.rs and main.rs)
  • a singe crate is present in compilation graph several times, with different cfg flags
  • a macro might duplicate its argument.

From the compiler point of view, this is not a problem: it processes sourced top-down and always sees them as a tree, even if they in fact form a DAG.

For IDE, the situation is different. Let’s see how goto definition might work. Initially, you have a path to file and an offset. Using this info, you can unambiguously retrieve a syntax model for a file, and find a reference node (identifier most likely) at the offset. Now, you need to map this identifier to a semantic model, and that is a problem, because the mapping could be ambiguous.

I think in practice this situation is rare, and just picking the first semantic object that fits would be OK. But because that would be a lossy API, it would be important to mark it as // XXX: might loose information, use only inside IDE, usages inside the compiler are forbidden.

EDIT: the mapping problem can be seen in this Roslyn sample. The var nameInfo = model.GetSymbolInfo(root.Usings[0].Name); line is what is problematic for Rust.


#10

I feel like more tests are always good :slight_smile:


#11

When you say “the current architecture”, do you mean the current architecture of rustc/RLS, or do you mean other compilers (e.g., Kotlin, Roslyn, etc)?


#12

I specifically meant rustc.


#13

OK. I guess I’m less worried about that, as long as we have participation from members of the compiler and RLS teams here. =)


#14

Semantic Model

To be able to talk about semantic model, we need to know project model: how source files are organized into crates, and what are dependencies between them. I think a precise model here is a directed graph, where nodes are crates (path to crate root file + a set of cfg flags) and edges are dependencies (so each edge is marked with an extern crate name). That is, a single source file (like lib.rs) may be a crate root for several crates with different cfg flags.

One approach to solve the syntax/semantic model mismatch is to have an explicit “expand sources” step in generating the semantic model. Expand sources takes the project model, the set of source files and generates, for each crate, a module tree with all macros expanded. The expansion step effectively duplicates some source files.

As a result of this step, we get

  • expanded source model, which consists of syntax trees, but has one to one correspondence with semantic model;
  • a one-to-one relation between expanded source and original source;
  • a one-to-many relation between original source and expanded source.

Most IDE features (think goto definition) which take a source position as an input would, as a first step, map it to the set of expanded positions, pick one (more or less arbitrary) expansion, map it to semantic model, process the semantic model, map semantic results back to expanded sources and map expanded sources back to original sources.

The semantic model of the expanded source would form the most of the API surface as it needs to include at least symbols(items) and types. OTOH, bidirectional mapping between semantics and expanded sources seems relatively straightforward.


#15

Technically, it would be a many-to-one relation, but, I don’t think that changes your point at all.


#16

Yep, I’ve messed up the terminology here, thanks!


#17

I always thought Rust would be the language where IDEs don’t have to guess at all. The language is so strict that it provides enormous amounts of information known at compile time. There is no need to implement heuristics when you can just say to the user “hey, this file is not a part of the workspace! Don’t you want to add it?” and don’t analyse the file unless they include it properly. Rust users are already used to strictness of the compiler, so IDEs can easily be just as strict. The IDE has no way of knowing what the user really wants to do with the file (maybe it’s just an input to the build script).

I’d love to have an IDE that is never wrong. If it suggests a method, it ought to exist. If it doesn’t suggest a method with a given prefix, then it doesn’t exist. This is the behavior that greatly increases the user’s productivity. When the IDE starts to guess, I start doubting it, and the productivity declines.

In other words, the code completion model must understand the code that will compile, and nothing more. This will work well for simple projects. I can see the challenge for projects with custom build systems, where it’s hard to reproduce the same compiler configuration in both IDE and the actual build process. However, I think it’s more efficient to work on support for these complex cases and provide strict analysis for them too instead of falling back to guessing. If the project builds successfully on a given machine, nothing prevents the IDE from being able to analyse it too (as a last resort, it can catch all rustc calls and figure out what’s going on in the build system).

In an example of two local folders A and B which seem to be unrelated, the IDE can just analyse how A is built. If B is involved in a build process, the IDE should take it into account. And if it isn’t actually used then A can be analysed independently.

I would say that the priorities should be the following way:

  • Support strict analysis of a build process that can be initiated by cargo. If the build results depend on the environment (e.g. build script accesses other files, user changes configuration, env vars, etc.) then the user must set up a locally working build process to that the IDE can understand what’s actually going on.
  • Support custom build systems.
  • Implement optional guessing strategies for random files opened by user and for cases like methods that are not available on current platform. The guessed suggestions must be visually distinct from proven suggestions.

#18

Hm, I don’t think “being 100% precise” and “being helpful if the code is incomplete” necessary contradict each other :slight_smile:

For files, which are not part of any crate, IDE could both provide all the analysis it can, show all the errors compiler can show (basically, every symbol not from external crate should be unresolved + special “detached module” error) and provide heuristics-based fixes (“Add mod foo; to mod.rs?”, “Add [[bin]] name="foo" to Cargo.toml?”).

In other words, the code completion model must understand the code that will compile, and nothing more.

This will definitely work great for offline code browser cases, like SourceGraph, DXR or Kythe. However, I don’t think this will be the right model for IDE: 99% of the time, the code won’t compile, because you are typing it in right now, or because you are in the middle of the whole project refactoring.

I don’t argue that IDE should use heuristics, IDE must be precise, and we are in the full agreement here. I argue that IDE should be resilient to broken code, broken builds and that it should be fast. Requiring that IDE runs the build before analyzing the code compromises these requirements.


#19

I don’t suggest running it all the time. I suggest running it once to figure out the pipeline (build scripts, locations of the files that are actually being built, etc.) and work from there incrementally. However, it’s unclear how to determine when something is changed enough that a complete re-run is necessary (a change in the code can lead to changes in the build process), so I don’t know if it’s possible to make it fully automatic. This also won’t work as well with broken builds (but IDE still can infer some information from the compiler errors and suggest possible fixes).

If the compiler and cargo can build the code, I expect the IDE to analyse it successfully no matter how complicated the project it. It should be completely possible if IDE is sufficiently integrated with the compiler.

There is also a problem with configuration differences between IDE and actual builds. It would be nice if the analysis used the exact same configuration I use for running/debugging. This means that when I switch from cargo test to cargo run the completion results may change. There are more complex cases like cargo run --features ..., and it would be nice if the IDE could analyse the run/debug command I use and configure the analyzer appropriately.


#20

Looks like we have a similar model in mind, albeit framed a little bit differently! Let me unpack my understanding of what is necessary to integrate IDE with a build systems.

First of all, I think it is important to keep build scripts completely out of scope of code analyzer. That is, the user should themselves run the usual build process, which could use build scripts to generate code at a build time. The code analyzer will see the generated code the same way as it sees all other code, just like the command-line compiler. It’s a user’s responsibility to determine when generated code might get stale and when a full build should be re-run.

The “locations of files being built” unpacks to the “graph of crate instances”, where a “crate instance” is a tuple of (root module, set of enabled cfgs, set of dependencies: named crates instances). I don’t think we need to actually “run” the build process to leearn this info? It is determined by the static project structure. Moreover, statically we could describe all various configurations we can have, while dynamic tracing of a build process will give us only a single configuration. If we know about all configurations, we can be more smart about picking “the current one”. For example, by default we can pick a config which corresponds to --all-features + --test + allow the user to tweak this on the Cargo level (use the same cfgs as cargo bench) + allow fine-grained tweaking of per crate instance cfg flags.

Determining if this “project structure” info should be regenerated is pretty easy in case of Cargo: you need to watch Cargo.toml, Cargo.lock, and certain well-known auto-detected files, like src/bin/*.rs. Determining which analysis results should be recomputed when source files change should fall out naturally from the incremental query engine.