Towards a Stable Compiler API and Custom Lints

A bit ahead of the timeframe I'd thrown out there! I have written up my high-level thoughts on a plausible path regarding custom lints and the far-reaching consequences of the stable API required to make it doable. While a lot of work, I sincerely believe this to be possible. I have deliberately left out most technical details, as I do not think it's necessary to include in an overview like this.

For any clarifications with the text itself (or to request your name be added), please comment on the HackMD (select the relevant text). For discussion, please use this thread. I'll be a busy for a little bit, but be assured that I will read every comment within the coming days.

Please do not post this anywhere other than IRLO and Zulip. I am trying to keep this discussion among those that are at least vaguely aware of the implications and challenges rather than having someone new to Rust be thoroughly confused as to what's going on. Plus I might turn this into a blog post at some point.

12 Likes

Thanks for writing up this first draft! I mostly share your vision on this project. I left comments on things that I think need some clarification. I'm definitely excited to work with you on this!

I have a bunch of thoughts here, although no coherent story.

First, on the rust-analyzer side, we are in the (slow) process of actually designing the "compiler API for 2nd and 3rd party consumers": HIR 2.0 for rust-analyzer - HackMD. Not much to see there at the moment.

Second, I do think that we'll end up with exposing stable-ish, versioned API for language analysis eventually. The C# and Dart are two histories which we should study and try to repeat. That said, I think that in the skill tree, "compiler API" should come after "there's a single code base which implements both batch compilation and real-time completions". We need to have production-ready IDE to understand both:

  • what API shape is best to implements lints and other functionality
  • how to implement the API in the most efficient way (so that lints can be shown while you are editing the code with O(size-of-edit) complexity and not with O(total-project-size) complexity)

Today, we are very far from that state: we have both efficient batch compilation and OK-ish completions, but they are provided by two unrelated lineages of code, with two very different architectures.

Third, we don't need APIs to implement custom lints. Its enough to expose data representation. We can ask the compiler to dump crate's AST annotated with types and what not to a .json file. This would be enough to implement custom lints. By adding a serialization layer, we mostly sidestep the stability constraints. In terms of using data to represent language semantics, this paper is quite interesting, as well as Swift's intermediate language

Fourth, even "annotated AST" is hard! The problem is subtle, not sure I can explain it in a good way. The issue has to do with internal representations, lowering and stability. It's easy to expose just the AST -- the AST is a surface, use-visible thing, and it is nailed down to a great detail. We can be reasonable sure that AST wouldn't change significantly, it is append-only.

However things like types or lifetime regions or control-flow graph are not faithfully representable in the surface language -- compiler uses compiler-private data structures for them. Here, we don't have any kind of stability guarantees. The way compiler thinks about these thing changes over time. For example, regions originally were represented as scopes in code, but now they are a set of points on the control flow graph.

So, when you try to expose type-checked code as a data-structure, you have the tradeoff between:

  • using the language of the surface syntax, which is stable, but inexpressive, and might be awkward for tools to work with
  • using the language of a particular compiler intermediate representation, which is expressive and convenient to work with, but adds significant backwards compatibility concerns.
13 Likes

Would also raise some implementation concerns. Not every implementation will represent the same syntax with the same intermediate representation, nor might it be useful to do so (as the internal data structures may lend themselves to an entirely different intermediate structure). This would be especially true at lower levels. For example, MIR vs. HIR, HIR would, imo, be more generally useful than MIR, which may not represent concepts the way the internals of the compiler may be designed to represent them.

You mention a few times in the doc that this would be perma-stable, and thus has to be backwards and forwards compatible. Why? Can't you just use normal semver and ask people to update their crates when they need new features or support for newer compiler versions?

There's also an assumption that this needs to be maintained by the rust project itself in-tree - is that the case? You can use rustc_private in any project, so this could start as an external project, which would let you experiment and get feedback much more quickly.

1 Like

Basically my suggestion is "go fork clippy_utils and see how it goes" :stuck_out_tongue:

Note: Since my name was mentioned in the proposal by @jhpratt, I first want to make clear that the following is my personal opinion and nothing I discussed with them.

I agree that it should only be stable in the sense of semver. Especially because developing this will not result in one "finished product", but rather a bare bones MVP of a stable API, where we'd then add features and interface functions as requested/required.

Is there? If so this should be clarified. My understanding was to use rustc_private and keep it as an external crate and maybe later include it as a subtree in the Rust compiler, once it's mature enough (and only maybe!). But maintenance should happens in it's own project or as the document put it:

Effectively, all work of tracking the internal API for all projects will be centralized [but not necessarily by rust-lang/rust]

Pretty much, but definitely don't do this and start from scratch. :smiley:

1 Like

I believe the point is for the latter to "just work", though for the former, I would agree.

That would help with keeping it up-to-date, and also portable between implementations (though this is probably not an immediate concern, given that, beyond mrustc, none are complete), because each implementation would maintain their own copy, like proc_macro, that operates on it's own internal data structures in the way it expects.

This would, as I mentioned, pose issues with other impls. What I would like to see is something like proc_macro, that exposes a stable api for talking whatever language the implementation understands (and the consumer of the crate wouldn't have to care as much about that language).

I don't see how this is possible. When the compiler internals change, you at least need a new minor version so the internals of the API crate still compile.

This would, as I mentioned, pose issues with other impls. What I would like to see is something like proc_macro, that exposes a stable api for talking whatever language the implementation understands (and the consumer of the crate wouldn't have to care as much about that language).

That's a much more ambitious goal, and I don't think it has a ton of benefit (other than for alternative implementations - and do they really need to support custom lints if they can still can compile the code?). Building this on rustc_private seems much more doable.

Sure, but it's scoped by semver and the compiler packages it, then a package can use the api from v1.x with rustc 1.y, developed later, because it exposes it's own copy of the api.

I imagine it is, to get anything stable into the standard library. Technologically speaking, I can't see it being too significantly difficult over w/e the implementation's plugin api is, or the proc_macro crate, except for the amount exposed.

If there are limits in the ability to build rustc, or to build code with rustc (for example, supported target/host restrictions, gratuitous use of unstable features in the compiler), I can find usefulness in the availability of both general and specialized custom lints to non-rustc implementations. Whereas the latter could be implemented using the implementation's own functionality and internal apis similar to rustc_private (and, in fact, may benefit from such), the former would basically require duplicating the existing work for something like clippy or otherwise (which my understanding of this proposal is that it is trying to avoid that, perhaps). I'd argue that, while not all implementations may benefit from or choose to provide this feature, enabling them to support lints written using this api, while neither rewriting themselves to internally follow the same structure as rustc nor emulating such structures, is a good idea for a "stable" api.

This makes me wonder, what about implementing lints as declarative 'query' API instead of literal AST iteration. For example, there might be an interface to register with the compiler: 'Call this method with a slice of all source locations where a variable of type X is declared' (using emphasis for parts that might be controlled by the crate registering the lint). At least superficially this seems to combine the best aspects of an output file while integrating more closely with the compiler and avoid serialization/deserialization overhead.

Without having given too much thought with regards to the specifics, this might provide pretty good decent compatibility as well. Additional syntactical elements (or combinations thereof) can be added as new result types, or integrated with declarative conditions if necessary.

1 Like

Agreed, this should definitely be the first step forward. And as I mentioned in the other thread, we already have an awesome tool for this: dylint. I think usage of it (or other similar tools, although that's the only one I know as of today) ought to become a bit more widespread before investing more infrastucture effort into imbuing it within an official / built-in status (like with proc-macros).

1 Like

Connor Horman (@InfernoDeity, I assume that's you, correct?) left a really interesting comment about how this API could be adopted by compilers other than rustc in the proposal. Would a useful goal of the API to be a facade similar to log or metrics?

Also, is the plan to treat this like read-only database (kind of like what @HeroicKatora said earlier)? Or is the plan to make this more like a lens?

My personal choice would be that we come up with an API that is a facade that can be used as a lens. This would let us do some really clever things programming-wise that are out of scope for rust as a language. For example, consider chained matrix multiplication via dynamic programming. If you know the sizes of your matrices at compile time (like you might with GLSL matrices), and you have enough information (like what the compiler might provide), you can choose the order you do your matrix multiplies to reduce the amount of work you'll spend doing them. I'm not on the compiler team, so I don't know if this kind of work has already been done by the compiler team, but I can imagine a crate like nalgebra using the API to create a custom optimizer that looks for and does these kinds of tweaks for the SMatrix type.

The main issue I see with this idea is security; with the possibility of rewriting the code that is being output comes the possibility of a bad actor inserting malicious code into the compiled program. I have no answer to this problem.

Yes, indeed, that is me.

What I was going for was something like proc_macro, that provides a standardized interface to compiler internals. It would be an abstraction layer over a direct interface to the compiler, which I guess is similar to what you're describing, though internal rather than external (Compilers would likely provide their own copies of the api, vs. hooking into the api through black magic).

I wasn't aware of that. Rust-analyzer definitely has far more thorough requirements than linting.

Admittedly my use of "compiler API" is a bit off — it would be an equivalent of HIR/THIR api-wise, not anything like codegen. I just couldn't think of a better term.

Intuitively, I don't know how this would be possible given that some lints are non-local.

Theoretically, but it wouldn't make writing lints pleasant by any means. Even in this situation there would need to be a stable AST, which I don't believe is currently the case.

How would it be sidestepping stability? The serialization is only because wasm interface types aren't a thing yet. I haven't yet had a change to read over those two documents, but I'll take a look at some point.

That's why I'm trying to get some effort behind a stable external API while allowing the internal API to change in whatever way. I do not dispute this is difficult — I stated as much in the post. The type-checking aspect would maintain the querying-style that is currently used for lints, which I presume to be lazy.

"Stable" is meant in the same sense as Rust as a whole — 1.0 but hopefully never 2.0. Semver would still be followed when it comes to expanding functionality. @flip1995 and I are on the same page here. New compiler versions, in general, won't require an updated linter. The only exception to this would be when new syntax is introduced for reasons that should be quite obvious. The linter would be statically linked with the requisite internal dependencies.

If that was the implication you got, I'll clarify. The intent was to have this be out-of-tree from the start. I honestly hadn't even considered moving it in-tree, though I suppose it could be possible in the long term.

And vice versa. I did not run the text of the post by anybody; I just put the names down of people who had expressed some interest in general.

That's a good way to look at it.

Long story short, static linking. New syntax will require a new minor version, but any release that doesn't introduce new syntax won't in and of itself, as the linter wouldn't rely on the versions the system has — it has its own copy statically linked.

They won't have to care! Due to statically linking with the internal libraries, the linter wouldn't care what implementation of Rust is actually being used. It's only the loading of the lints that is dynamic; the rest is still static.

I presume you have looked at the negatives of a tool like dylint that I mentioned?

I'm not sure how what you're saying is related to a stable API and/or a linter.


Overall, I need to limit the mentions of rust-analyzer to be at most tangential, as it's clear that their requirements are quite a bit more than those for a linter. I believe this comment should clear up some of the ambiguities in my post, though I will of course update the text. I intend on reading some of the materials linked to by various users in this thread.

1 Like

Well, the implementation would have to care, insofar as to provide the api. Linters ideally should not care, though, just like a proc-macro doesn't need to care, but the implementation still needs to provide the api.

Yeah, I'm referring to the external API provided. The linter could use any version it wanted, but would presumably be kept up to date. Were that not the case newer syntax wouldn't be parsed properly.

It's noteworthy that Rust still allows itself some breaking changes and the reservation of new keywords with editions. They are planned to stay on the parser/lexer level. Still, I think that we should keep possible changes in new editions in mind when designing an API.


I would really love to have such functionality and there has been a suggestion to add syntax tree patterns to Clippy in rust-clippy#3875 which would work similarly from my understanding. Something like this could also make writing lints easier. However, I see them as additional layers that can be added by extra crates based on a stable API. Maybe in the form of macros that translate such queries.

The Design of the stable API itself should probably use the AST representation or something similar to focus on the stability part and not worry about any additional translation logic as long as that could be added as an extra layer. :upside_down_face:

The parser could be aware of which edition is being used. Keywords would just get parsed as something different in the internal AST, so a is_async() method could be provided without being a breaking change.

This was discussed on Zulip. It's certainly an interesting idea, but not something I'd like to commit to this early.

1 Like

You're right, it wouldn't be necessary for a linter. I got excited and and thought about what a stable API could do, and outlined something that is far outside of the current scope of discussion. My apologies for hijacking the topic (though I still stand by my contention that if the API were well-designed then it would be possible for crates to define optimization passes for the code they provide! :upside_down_face:)