Let proc macros pass information to the compiler

Returning somewhat to the original question: wouldn’t it be possible to trace the syscalls that a proc-macro makes while it’s executing and infer the files that it needs to watch from that? Of course this won’t help you if the proc-macro picks files at random or does network IO, but anyone who writes a proc-macro like that is probably deliberately trying to fuck with you.

Why would less than 100% correct inference be used when you could just have Macros report their external dependencies accurately? For example, if a macro opens a directory and reads various files from it there are 3 different possibilities (at least);

  1. The Macro wants to process all files in that particular directory
  2. The Macro wants to process specific patterns of files in that directory
  3. The Macro wants to process specific files in that directory (and used directory traversal for no good reason)

With inference there is little chance of determining the correct option from the above 3. But, if the Macro reported to compiler what it was interested in, then, the compiler/IDE can be smart about when to ask for the Macro to recompile.

1 Like

Well we could maybe have both. But why would it be less than 100% accurate? If a macro does this:

  1. Opens a directory.
  2. Iterates the entries.
  3. Opens some specific files.

Then we know exactly what we need to look for: changes to entries in the directory (files being added, removed, renamed, etc), or changes to the specific files that it opened. The syscalls tell you exactly what info the macro actually looked at.

If the macro is poorly coded, to

  1. Recurse through directories
  2. Open an arbitrary file called lookatme.txt
  3. Only care about that contents

It could report that it only cares about the contents/identity of that file for recompilation. Sure it’s a badly coded routine, as it should probably take a path argument or just only look in one place, but it can report better than the heuristic can what it cares about.

The best-of-both-worlds would be to heuristic for the ones that don’t report what they care about (current and lazy future) but allow IDE- and incremental-friendly ones to specify what they care about.

The answer as to why we don’t just use the heuristics is the same reason we don’t use them for build scripts, though. They’re imperfect and there are always ways to accidentally escape the tracing.

1 Like

Why would you assume that a poorly coded macro would correctly report what it depends on?

2 Likes

You wouldn't, but, it would be clear that the poorly coded macro is in fact a poorly coded macro and it would be clear where to open an issue. With the heuristic method, it becomes somewhat of an interpretation whether the heuristic is to blame or the macro author. Also, if the requirement was that the macro must provide such a report or the RLS/Analyzer effectively ignores that macro, then, that forces macro authors to think about this consideration which would then make them more likely to report accurate information rather than relying on accidentally working/not working.

That being said, I definitely see the allure of the heuristic method.

I see the allure of the original suggestion too: explicitness is generally better than implicit magic, and it’s good to be clear about exactly who’s responsibility it is to make the incremental compilation information accurate. However I think, in practice, the syscall-tracing method could be extremely accurate.

Breaking it down, there’s two ways either of these methods could fail: (a) they fail to recompile when they should or (b) they recompile spuriously.

For (a), I think the syscall-tracing method is guaranteed to work all the time. By comparison, if we make the macro author pass the information to the compiler it’s possible that they could fail to pass some relevant piece of information. Bugs like that are likely to be fairly common, but at least the user would know where to file a bug-report.

For (b), it’s possible that a macro could iterate a directory for no reason, causing the syscall-tracing compiler to recompile the project every time that directory gets touched, but this is a pretty contrived scenario which is unlikely to occur very often in practice. It’s also possible that a crappy macro could pass spurious information to the compiler and cause the same thing. Either way it just means that compiler will end up doing more work than it needs to.

This is all assuming that macros are deterministic and only get information from things that the compiler can watch or control (eg. the filesystem, environment variables, etc). If this assumption fails then neither method can work.

1 Like

Macros can do anything Rust code can do -- macros could for example take some URL and download a GraphQL schema from it and generate code from there (which would actually be a useful application). I'm not sure how that plays out in terms of syscalls.

Buggy code is buggy code. Part of the responsibility of the macro author is will be to make sure that recompiles happen -- I think the only alternative is that rustc proactively recompiles everything wherever a procedural macro is involved, and this will suck for compile times.

My point was that neither tracing filesystem access or having the macro author emit instructions to the compiler would allow you to handle this situation. The compiler can't watch an external URL for changes.

With regards to the Code Analyzer knowing when to Re-Run/Re-Compile Macros and Macro invocations, I would think it is safe to say the following:

  1. Macro definitions can be recompiled whenever any of the code in the crate or dependent crates where the Macro is defined changes. If a macro definition is recompiled, then, all invocations in the current project space need scheduled for re-compilation.
  2. Macros could and probably should report which resources they depend upon to include: files/directories within the project/workspace, environment variables, external URL’s, resource connections (DB’s, etc. - really just a variant of external URL’s though). Changes to the dependent files/directories or environment variables can cause auto-recompilation of the macro invocation. External URL’s cannot for sure be monitored, but, it is possible that tooling could develop for this at some point or at least UI affordances made that would allow the user visibility and ability of when to force a macro invocation recompile (think a “Refresh Macros” button related to the external URL)
  3. Macro authors should avoid making things into Macros that would be better as build.rs scripts (just because something can be a macro doesn’t mean it should be). Examples here would be: generating “classes” from database schema, generating “classes” from SOAP WSDL, generating classes from HTML templates (basically the theme here is, "generating code from things that are external and relatively static for the life-time of a project version)
  4. Macro users should consider (and the IDE should suggest and offer re-factoring tools to accomplish) factoring complex macro invocations into sub-crates that are built independently from their main logic crate(s).

Now, these are just thoughts of mine that probably require further refinement and nuance than I’m probably fully expressing here, but, I think the main idea should be relatively uncontroversial. Thoughts?

1 Like

The macro author could use some kind of compiler API that could handle this. For example, the macro gives the compiler some hash value, and the compiler asks the macro whether the hash is still valid.

1 Like

Hash value of what though? I'm not sure I follow.

That all sounds good, but I still can't think of anything that the compiler would be able to watch if told to, but couldn't figure out for itself that it needs to watch. Are we going to build support directly into the compiler for monitoring different kinds of databases? If not, then what can the macro do to tell the compiler when it needs to be recompiled? And if we are, then why not just trace the macro's communication with the database and figure it out ourselves?

I would say environment variables and whether the macro is interested in specific whole directories, patterns of files, or specific files.

I would say no. But, the Macro could let the "compiler/IDE" know that it has dependencies on a database connection, URL, or other external resource, and then the IDE can provide UI affordances to make the user aware of that and provide a way for the user to "Refresh" on demand.

For exactly the reason you've allude to. Are we going to build into the compiler support for every kind of possible external resource that any given macro, now, or in the future, might possibly depend upon?

2 Likes

I am still not fully convinced that a code analyzer needs to know exactly what macros do as opposed to what they expand to. A code analyzer that needs to know what macros do has to solve the problem of, given rust source code, understand what it does. This problem is not solvable, so IMO the design of a code analyzer that needs to solve it would be flawed.

Therefore, a code analyzer has to work properly if it does not know what a macro does but it could provide better results if it did.

@gbutler proposes adding APIs to allow macros to express what they do, but we already have an API for that: it’s called Rust, and everything that Rust can do, macros can too. Adding extra APIs will never be enough to cover all that Rust can do, nor will it help with existing proc macros that do not use these APIs.

From the ideas I’ve heard here and on IRC, the best one I’ve heard is to just use Rust as the API, try to run proc macros in miri, and instrument what they do: does the proc macro access files? does it read or write on them? does it open network connections? does it spawn threads? etc. This will never be perfect, but since a code analyzer cannot rely on knowing what a proc macro does, that’s ok. The main advantage of this is that we can make this incrementally better by instrumenting more and more code. For example, we could instrument the std I/O features first, and expand that to std::os::raw later, and maybe at some point even instrument libc::syscall.

I would not characterize it that way. I would suggest that what I (and possibly others) are advocating is a mechanism/API that allows Macros to let the compiler/code analyzer/IDE know what inputs (beyond the body of the macro itself) they depend upon for requiring recompilation. I don't think that is the same at all as "What they do.".

I would not characterize it that way. I would suggest that what I (and possibly others) are advocating is a mechanism/API that allows Macros to let the compiler/code analyzer/IDE know what inputs (beyond the body of the macro itself) they depend upon for requiring recompilation. I don’t think that is the same at all as “What they do.”.

Whether re-compilation is needed depends on what the macro does.

I disagree. It depends upon whether the dependencies/inputs have changed. It doesn’t matter what the Macro does with those inputs, only the fact that it depends on those inputs and if those inputs change, then, the macro needs recompiled.

AFAICT proc macros only need to be compiled once, what might happen is that they might need to be re-expanded. Whether a macro needs to be re-expanded only depends on the macro inputs, but given that a macro can take the whole system as input, I still don’t think how adding APIs for expressing what its inputs are is better in any way than just using Rust for that.

When I said, "recompiled", I intended to mean "re-expanded". I was referring to macro calls not macro definitions and I would say "recompiling" or "re-expanding" a macro call is equivalent terminology.

Macros can take anything in the "World" as input, true. But, a given particular Macro will not take the "World" it will take specific enumerable inputs that it itself can always accurately and unambiguously enumerate. The difference in approaches is Implicitness/Magic vs Explicitness/Precision.

Only if there are APIs for every single input that it uses, and if the user does not make a mistake using it. If no API exists, then the user will have to wait until we add a new API, which might require an RFC, etc. and will not work on older macros until they start using the new API. A miri-based approach would only require a PR and it would retroactively work on all macros.

If the body of a macro changes, the inputs might change. A miri-based approach would either work or reject the macro. An explicit-API approach needs to handle three cases: API is used properly, API is used improperly (if the constraints are not updated), API is not used.

The miri approach would also work properly across toolchains (e.g. either the toolchain's miri can instrument / handle the macro or it cannot). Using explicit APIs would require the source code to handle different toolchains explicitly, e.g., if the API needed is only available in nightly or beta, but not stable.

1 Like