Breakage of fragile proc macros in nightly-2020-07-03

We depend on a sort of fragile proc macro at my work which does string matching on each line of the TokenStream and performs some operations on what it finds. This macro was broken was of nightly-2020-07-03 since the input TokenStream no longer has line breaks in it (rollup: https://github.com/rust-lang/rust/pull/73950)

While I didn't write the macro in question I think part of the reason why this string matching approach was taken is due partly to the lack of resources for writing proc macros from scratch using zero dependencies. Here is a Twitter thread I found which expresses some of my own thoughts.

Here's a reference to nanoserde which may serve as a good reference for anyone else wanting to correctly write a proc macro with zero dependencies: nanoserde/derive/src/parse.rs. Also mentioned by Manishearth in that PR is their crate, absolution which is lighter-weight than syn.

Although it's 100% fair to say that my case is a misuse of the TokenStream APIs I wanted to raise the potential for breakage in other crates and to maybe kickstart some discussion about the things mentioned in that Twitter thread -- notably the lack of documentation for using plain TokenTree and TokenStream.

FWIW, at lot of people don't realize you can use syn without the whole AST and parser. If you turn off all the default features, you're left with what's pretty much just a parsing framework for parsing TokenStream.

(I'm actively working on a parser generator that uses syn as a backend, because it's cool, I guess?)

On the other hand, I've personally had great success with using watt for proc macros. It's a bit more work to set up, but works very well to allow you the developer to use all the nice things without impacting downstream (build) dependency load.

1 Like

Out of curiosity, where does this requirement come from?

2 Likes

Some of the people I work with have a strong C/C++ background and are against bringing in external dependencies for multiple reasons: risk of supply-chain attacks, general dependency on someone else's code, dependency bloat, and increased compile times. IMO it's a little ridiculous but I understand the concern.

Proc macros in particular though are an area where it's hard to find intermediate docs without the use of an external library.

1 Like

In general the compiler is not able to make guarantees about the exact formatting of TokenStream's to_string except that it's able to be parsed back to a (possibly different) TokenStream. Parsing the string in a way that expects exact whitespaces/newlines in particular places is not a good plan.

Further, for attribute macros and derive macros the compiler is not able to make an exact guarantee about the non-whitespace tokens either. The presence or absence of trailing commas and similar syntax minutiae can change over time.

2 Likes

At least for me, I've partially rewritten the macros in the time crate to function without using syn or quote, due to the added compilation time. Unfortunately it's far from trivial, given my limited experience with proc macros.

1 Like

One thing I noticed for myself is that the clarity of the derive macro I've been working on lately has improved quite a bit as a result of using syn and especially the quote! macro. The reason for that is that quote functions as a templating macro and thus it looks a lot more like the target code than manually manipulating a syn syntax tree (which i was doing) or worse, manually reparsing the input tokens.

A useful side effect of the above was that I was able to refactor a big piece of the code to no longer have quadratic runtime in the input (specifically, when used on enums the macro would be quadratic in the number of enum variants defined by that enum). Now the runtime is linear, providing a nice speedup.

And all that comes at a compile time cost I'm not even sure is noticeable (but the crates I'm using it in mostly have somewhat longer compile times already, which is understandable when you consider that it's in the order of 15KSLOC of macro-heavy code apiece).

Additionally, in the context of breakage, I haven't seen any so far in my derive macro. But then again, personally I'd never make a correctness tradeoff to gain a little (compile time) speed unless there really was no other way and it was really necessary for some reason. Which is almost never for me.

1 Like

Agreed, syn/quote! are extremely useful -- especially when you want to use spans for better errors. I basically modeled my own proc macro after serde and have found maintenance to be a breeze since.

The breakage we encountered (note: not in the macro I referenced above) was mostly because the existing APIs without external crates are hard to use and the cheaper solution was to parse the tokens as strings. Which definitely isn't the way you want to parse tokens of all the things, but was the cheapest and most straightforward.

3 Likes

I just noticed you work for Microsoft. Now the quoted text makes perfect sense to me :slight_smile:

Those engineers should be aware that Rust is both a lot like C++ and nothing like it at all (non-optional safety + built in ecosystem tooling each are a total game changer). By that I mean that it would be useful for them to "not knock it before they try it", so to speak.

Now I know you likely already know all of this, but if you ever have another coffee chat with one of the C++ people, that might help ease their apprehension of Rust and let go of some of those C++ assumptions (e.g. the less dependencies a project uses the better, and therefore it's worthwhile to put effort into optimizing for that) as dogmatic, general truths. And in the process maybe it could make your own life a little easier by taking away restrictions that may seem reasonable for C++ but are just silly for Rust.

Probably not the best way to start this post but I'll start out by saying my mind feels a little scrambled today so apologies for being a little over-verbose.

I think that the problems I laid out above are very valid concerns:

It probably wasn't even worth mentioning that the folks are primarily C/C++ developers. It was more to convey the attitude that for an already sufficiently large project and a problem small enough adding extra dependencies becomes a tradeoff worth evaluating. I know this attitude is already common among Rust experts. burntsushi for instance tries to avoid adding dependencies to ripgrep which increase clean compile time above a target.

There's many factors that possibly add time such as general crate bloat, scope beyond your problem, and not having perf in mind, but for a problem as small (and common) as:

  • Parse a struct
  • Derive a trait with work done based off of each field and its types

It seems kind of ridiculous that a built-in language feature for such a task has no documentation for how to do this with built-in data types (even if you have to be a bit more strict with formatting and such). You're hard-pressed to even find community documentation on it. I haven't talked to the dev in question but I can certainly see the easiest solution for some wins on clean compile time was certainly just to convert the TokenStream to a string and do matching. As I've said, I know this is not a proper usage of the APIs but I think the rationale behind it is relatable.

Referencing nanoserde this week was the first time I actually saw how you can cleanly go from a TokenStream to identifying a data type, fields, etc. It'd be amazing to have a built-in method for the following process:

  1. Asserting the container type for a TokenStream
  2. The container is a struct
  3. Iterating struct fields by name and type
  4. Building your own string that you can convert to a TokenStream and output

I understand that there are many edge-case scenarios here such as unnamed fields, anonymous/inline types, full type paths, parsing attributes, doc comments, lifetimes, enums, and why all of this complicates API structure leading no built-in/stable APIs. With that said, reference docs for how to approach what seems like such a common scenario without dependencies is seriously lacking and after reading the nanoserde docs, doesn't even seem to be very hard.

Sorry for the book. Just getting some thoughts down. Since I'm complaining so much I'll probably take it upon myself to add some info to the Rust book for the exact scenario mentioned above.

1 Like

These are valid concerns, but:

Supply chain attacks are a part of business life. It isn't any different in the real world, where you have additional difficulties in the form of supply chains not being able to deliver required quantities. The mitigation is the same for the real world as it is with software though: unless you plan to vertically integrate all the way, it's a game of risk management by its very nature, regardless of which programming language you use.

General dependency on someone else's code: see paragraph above. I will also add that NIH syndrome is generally not considered a good thing, niche software with extreme requirements notwithstanding.

Dependency bloat is no problem for me personally, but I can easily see how it can be in constrained environments, or when scaling up to the size of a AAA game, or significant subcomponents of e.g. Windows or a browser. There are things that can be done about it (some of which you have already mentioned), and it would be great if that could be a more automated process.

Increased compile times: I'm fully in agreement with you that it would be more than a little great if these would go down significantly. But all that type checking and LLVM-based code generation is expensive. I'm also not so sure that without domain-specific accelerators these times even can go down significantly (data points: Rust, and especially Haskell with a long history of go-have-yourself-a-swordfight length compile times), and that's assuming there is unexploited potential there, which may or may not be the case. Given that, plus the tendency of programming languages grow in the direction of dependent types (we're far from there for mainstream languages, but in 10 years it could well end up there), I wouldn't expect compile times to go down with any significant quantities any time soon unfortunately. Wether or not the additional safety guarantees are worth the extra time spent compiling will depend on the circumstances I expect.

1 Like

Wouldn't that be an argument in favor of avoiding dependencies?

Indeed you could view it that way. But it would be one that should be weighed against all the effort it takes to develop those dependencies (or the parts of the functionality contained therein that is of interest), debug them, and otherwise maintain them.

As far as I know there are no quantitative tools to help one make that decision, so it effectively comes down to which properties (compile times, adhering to established engineering principles such as NIH, effort in developing your own private version of would-be dependencies, etc) one needs/desires most I guess.

Personally I sort of deal with it by throwing hardware at the problem (16 ThreadRipper cores put quite a dent in compile times as dependencies can often be compiled in parallel), but I understand not everyone is in a position to do that. I also have the advantage of working with 10-20KSLOC codebases rather than >100KSLOC, so I don't get to experience the more extreme runtime blowups of superlinear algorithms in the compilation process against which a multicore processor would be useless. Compile times of < 1 minute for a codebase of 10s of KSLOC is reasonable IMO, and even more justified when considering that it is quite macro-heavy code (i.e. a hypothetical C-equivalent could easily be some multiple of that).

Rather than debating whether dependencies are useful/dangerous, can we address the main point, which is that there's very little documentation about proc-macros other than docs.rs/syn. Is anyone interested in writing that documentation? How different is TokenStream from syn::TokenStream - does it make sense to reuse some of the same docs for both?

It’s an apples to oranges comparison. syn does the parsing and heavy lifting for you but it can be overkill for very small projects. The problem instead is that std TokenStream just has no examples of how you should set up your parsing state machine.

I posted a small reference project on reddit about a month ago to fill that void: https://reddit.com/r/rust/comments/hq1aa3/a_reference_for_creating_proc_macros_without/

Still need to update the rust book though.

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.