I have a bunch of thoughts here, although no coherent story.
First, on the rust-analyzer side, we are in the (slow) process of actually designing the "compiler API for 2nd and 3rd party consumers": HIR 2.0 for rust-analyzer - HackMD. Not much to see there at the moment.
Second, I do think that we'll end up with exposing stable-ish, versioned API for language analysis eventually. The C# and Dart are two histories which we should study and try to repeat. That said, I think that in the skill tree, "compiler API" should come after "there's a single code base which implements both batch compilation and real-time completions". We need to have production-ready IDE to understand both:
- what API shape is best to implements lints and other functionality
- how to implement the API in the most efficient way (so that lints can be shown while you are editing the code with O(size-of-edit) complexity and not with O(total-project-size) complexity)
Today, we are very far from that state: we have both efficient batch compilation and OK-ish completions, but they are provided by two unrelated lineages of code, with two very different architectures.
Third, we don't need APIs to implement custom lints. Its enough to expose data representation. We can ask the compiler to dump crate's AST annotated with types and what not to a .json
file. This would be enough to implement custom lints. By adding a serialization layer, we mostly sidestep the stability constraints. In terms of using data to represent language semantics, this paper is quite interesting, as well as Swift's intermediate language
Fourth, even "annotated AST" is hard! The problem is subtle, not sure I can explain it in a good way. The issue has to do with internal representations, lowering and stability. It's easy to expose just the AST -- the AST is a surface, use-visible thing, and it is nailed down to a great detail. We can be reasonable sure that AST wouldn't change significantly, it is append-only.
However things like types or lifetime regions or control-flow graph are not faithfully representable in the surface language -- compiler uses compiler-private data structures for them. Here, we don't have any kind of stability guarantees. The way compiler thinks about these thing changes over time. For example, regions originally were represented as scopes in code, but now they are a set of points on the control flow graph.
So, when you try to expose type-checked code as a data-structure, you have the tradeoff between:
- using the language of the surface syntax, which is stable, but inexpressive, and might be awkward for tools to work with
- using the language of a particular compiler intermediate representation, which is expressive and convenient to work with, but adds significant backwards compatibility concerns.