How can we improve `proc_macro` without dependencies?

There are a lot of use cases where syn is a heavy library used to solve simple proc macro problems. Basic proc macros can be written without it, but it isn't very ergonomic at this time.

Is there anything we could add to the library to make this easier?

I don't foresee a future where syn isn't needed for more complex macros, but maybe there are small concepts we could pull in to simplify writing small macros.

A very loose thought is that we could create a common Error type (or trait) to make compiler errors more straightforward (panicking is common for small macros, that's not nice), and a Parse-like trait that keeps the same flow as syn within the constraints of what is currently available. For example:

// In theory a `#[proc_macro]` could allow functions that return a `Result`, but
// it isn't required.

/// Parsing a struct-like item here
#[proc_macro]
fn my_macro(ts: TokenStream) -> Result<TokenStream, Error> {
    let p = ts.into_parser();
    let name: Ident = p.parse()?;
    let contents: Group = p.parse_delimited(Delimiter::Brace)?;
    let p_inner = contents.stream().into_parser();

    assert!(p.is_empty());
    
    let mut v = Vec::new();
    while !p_inner.is_empty() {
        let key: Ident = p_inner.parse()?
        let _ = p_inner.parse_punct(':')?;
        let value: TokenTree = p_inner.parse()?;
        if !p_inner.is_empty() {
            let _ = p_inner.parse_punct(':')?;
        }
        v.push(key, value);
    }
    
    // ...
}

A notable deviation from syn is that we don't have strong types like syn::token::colon and punctuated to just .parse() anything without separate parse_punct/parse_delimited, but this seems acceptable (we probably don't want to add hundreds of types).

The goal is not to replace syn. It does seem reasonable that there should be a more painless dependency -free way to parse simple things such as attributes or basic structures.

Any thoughts?

(@dtolnay I am sure you may have some insight here)

5 Likes

I would love this!

One thing I consider essential for a proc macro parsing API, even a bare-bones one, is automatic retroactive reporting of tokens that your parser never looked at. Without this, it's extraordinarily difficult to write a parser that robustly rejects all the possible inputs that aren't intentionally supported.

For example with syn:

use proc_macro::TokenStream;
use syn::parse::{ParseStream, Result};
use syn::{parenthesized, Ident, Lit, Token};

#[proc_macro]
pub fn repro(input: TokenStream) -> TokenStream {
    syn::parse_macro_input!(input with parser);
    TokenStream::new()
}

fn parser(input: ParseStream) -> Result<()> {
    let _: Ident = input.parse()?;

    let paren;
    parenthesized!(paren in input);
    let _: Lit = paren.parse()?;
    let _: Token![+] = paren.parse()?;

    let _: Token![;] = input.parse()?;
    eprintln!("parsed semicolon!");
    Ok(())
}
parsed semicolon!
error: unexpected token
 --> src/main.rs:1:23
  |
1 | repro::repro! { f(1 + 1); }
  |                       ^

In your design sketch, I don't see an affordance for this checking to be doable by whatever type into_parser() is returning. In other words I don't see a line at which the "unexpected token" Error could pop up, sufficiently late after the parser knows the macro logic is done parsing the entirety of what it intended to parse.

Maybe unparsed token reporting would happen outside of the visible code in your sketch? I.e. whatever glue code in rustc responsible for making the call to my_macro would need to arrange to find out whether the macro implementation ever called into_parser(), and if so, then check after the macro returns Ok whether anything didn't get parsed?

2 Likes

Awesome :slight_smile:

Interesting point, what exactly is the mechanism for this in syn? It seems like maybe ParseStream::drop stores its unprocessed tokens, which parse_macro_input raises the error on.

If that is accurate, maybe it would be possible to just add a check at the bottom of #[proc_macro]'s expansion without even getting too much into the glue.

From an IDE perspective (especially with completions in mind), allowing to jump out with an error by default like this will make the macro experience for tools even worse than it already is. An IDE ideally wants proc-macros to expand to something reasonable at all times so that it can decently calculate completions for the current cursor location. So from an IDEs perspective, whatever parsing power we want to give to proc-macros by default (that is dependency free) should allow for some kind of recoverable parsing. If a proc-macro completely discards its output for error reporting this no longer works. You could argue that for attributes at least the IDE could just re-use the input as the output for its analysis, but that alone still loses a lot of meaning for the input.

As an example think of a token being re-used in some way in the expansion. Now this token will lose and gain analysis information on its usage whenever the input changes from valid to invalid and vice versa while typing which could result in flickering highlighting (with semantic highlighting), hover no longer showing expected things until the input is valid again etc. (This is strictly speaking about syntactic requirements the macro imposes, not rust syntax as r-a fixes up invalid rust syntax in attributes as is required).

Now for function like proc-macros re-using the input obviously does not work as the majority of them will not have proper rust syntax. Here the IDE won't be able to do anything without the proc-macro helping out by trying its best to keep expanding on invalid input.

As for derives, it is less severe as the only input where this somewhat makes sense for them is in inputs to derive helpers, but here as well the macro can help out by not bailing out immediately as completions could be offered in these positions as well!

So with that said, I'd rather we explore a default parser mechanism that allows for recovering on unexpected inputs and nudge the proc-macro authors to write "infallible" macros (and make this easier than it is today!).

As an aside for error reporting which would be relevant given my argument, there is a proposal for a diagnostics API which would allow to easily report errors without returning some kind of error type, panicking or having to emit a compile_error! invocation: https://github.com/rust-lang/rust/pull/83363

4 Likes

This would be great, thanks for the insight. Do you have any ideas what the API would look like with a parser?

My initial thought is that parser functions would still still return a Result, but with some combinators to emit a diagnostic then provide a default. Maybe something like

let name: Ident = p.parse()
    .or_diag_error("you need to provide a name")
    .emit_then(Ident::from("name_placeholder"));

Edit: this may actually be better if emit_then just takes a &str and then parses it via a TokenStream to the target type (Ident), automatically setting the Span from the input that you tried to parse. If it can't parse to the target, this would be a panic.

2 Likes

I recently started to make a little proc macro library for small proc macros that don't need syn, for a proc macro I want to make, and ran into problems with literals.

Say I want to make a proc macro that takes a single string literal as an argument. I can pretty easily get a TokenTree and verify that it's a TokenTree::Literal, and I can get its span, but that's it. The only option I have if I want to know more is to call to_string on the Literal and then re-parse the literal myself. For a string, that includes stripping away the quotes around it and re-applying escapes if it's not a raw string. If rust ever adds new escapes or makes other changes to string parsing (for example the stripping of multiple newlines after an escaped newline, which is explicitly marked as subject to change in the documentation), my string parser will need an update and maybe even switch between multiple implementations based on compiler or edition version to match the user's expectations.

I'd like access to the literal that the compiler has already parsed so we can avoid this complexity.

1 Like

That's right. More generally, not just parse_macro_input but every entry point into the parser (anything that instantiates a ParseStream from user-provided input, whether that is &str or proc_macro::TokenStream or proc_macro2::TokenStream) will propagate an error if any ParseStream during the parse got dropped without finishing consuming the tokens.

The implementation is a little more complicated than this because syn's parse streams support forking, with the intention that you can parse ahead in the fork before deciding how to parse the original stream. So there might be multiple parse streams encompassing the same tokens and not all of them will necessarily get to the end.

This could work but I'm not sure if it's the best approach. Consider all the other places a macro might get a TokenStream and want to parse it via into_parser, other than the main TokenStream that comes in as the macro argument.

For example if they use the Iterator API on the main TokenStream to arrive at a Group token, then group.stream().into_parser() on the Group. Would code at the bottom of #[proc_macro]'s expansion know about this parser?

Or if the macro loads some DSL from a file using proc_macro::tracked_path and uses TokenStream::from_str to parse the file contents as tokens, then wants to parse the tokens. Would it be able to do this in a way that correctly reports tokens that were not expected? Or would into_parser and everything else look like it works, but then silently not report unexpected tokens.

I don't know for sure that these use cases all must be served by proc_macro's parsing library, but syn's approach works in these cases, without requiring diligent manual application of p.error_if_not_empty()? in all macros (your code shows assert!(p.is_empty()) but that would be a worse error). Macro code (like in my previous comment) is written to handle the "happy path" only, and error reporting for all invalid input that does not match the "happy path" happens automatically without most macro authors ever thinking about it. It would be great if that property can be preserved even in a smaller API that isn't built around an elaborate syntax tree.

Welcome :slight_smile:

I definitely agree, that is an annoyance. We might even be able to add something like a non_exhaustive version of syn::Lit at this time since it is useful even outside of parsing, just with a From<Lit> for Literal and vice versa. (Well, maybe a different name like LitType so we don't have both a Literal and Lit).

The alternative is to add a bunch of methods to Literal to extract its type, but that's clunky.

1 Like

Good point, I suppose there could be a lot of code that the proc_macro expansion does not know about.

I think it definitely does fit, it's something easy to forget (I write a lot of macros and have never even considered the importance). Easier to design in now than later.

Maybe this could just tie in with the diagnostics API and emit a Diagnostic::error on unprocessed tokens. Rather than needing to be checked at the parser entry point.

1 Like

Thank you :grinning:

I would theoretically like the equivalent functionality to declarative macro matching, TokenStream::match_ty() -> TokenStream for $foo : ty and the like, that would probably handle a decent chunk of the work and doesn't seem to be any more of a versioning hazard than declarative macros already are. No strong feelings though...

We discussed this topic during our wg-macros open discussion.

There are two main points I believe are crucial to emphasize:

  1. Writing Effective Error Messages is Challenging. Consider the extensive code in rustc dedicated to generating messages. Often, it might be more beneficial to use a library that enhances beyond what the compiler alone can offer. An additional point is raised in a Zulip discussion about the implications of this feature on IDE/rust-analyzer integration.
  2. Stabilizing and Researching a Basic API for Error Message Creation. Several pull requests are currently implementing these APIs. It would be beneficial to experiment with them to gather further insights and feedback from their practical application.

While this is a brief overview, those interested in more details can view the full agenda here.


While this is my current area of interest, maybe it is worth talking about also if we can introduce some basic API for parsing rust syntax. This was raised by the Linux kernel people, but I am not sure that there is a good answer for it?

Currently, I do not know what is the big blocker for having this API, I am assuming that rustc uses experimental features and this will make the API unstable?

I'm on my way out the door, so this is not going to be particularly in depth. But one thing I believe would be useful to look at for attacking this problem is the SARIF format. It is the static analysis results interchange format, which is a much more general type than what we usually see with rust errors.

By that I mean it's error would be considered a concrete structure shared by all static analyzers. Where instead of containing just spans, it contains a rich format string (essentially markdown links and references) the target of which can describe both source and target spans relatedLocations and how they are related kinds (such as an error which references both the original source, and the sources produced by the macro expansion).

It is a pretty big departure probably from e.g. the Error trait we are all familiar with. But I believe that they way that they serialize error messages from what are arbitrary static analysis passes -- without resorting to strings, seems like it should also be worth a look for people also trying to improve the errors in macros too. Alas it is not exactly a small specification.

Rustc already has an error message json format. Whatever errors a proc macro produces has to fit into this. I took a quick look at the specification of SARIF and it seems to have a lot of flexibility that doesn't make any sense to be output by rustc, and can't be fit into rustc's error message format either. Note that rustc's error message json format contains both the rendered error message as rustc would print it (for cargo to print) as well as a detailed breakdown of the error for ide's to show inline.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.