Completing rustfmt and the Rust style guidelines

I really don’t see any benefits to the AST approach, only downsides. Storing comments (and whitespace!) in the parser AST is annoying and adds extra maintenance burden. Formatting incomplete code is an important use case when integrated into an IDE. If you want to format macros, you are going to have to build a pseudo-AST anyway, at which point you might as well just use it for all the code in order to avoid duplication. And you really want a “logical line” abstraction in order to properly use the Knuth-esque line breaking algorithm that clang-format uses.

This is all from experience with the existing pretty printer, which tried to use the AST and never really performed properly. In particular, we had big problems with storing whitespace and comments in the AST: you want to preserve blank lines and such, and you want to place comments in the right place (above or to the right of the things that they’re commenting). The neat thing about clang-format is that because it formats and handles logical lines immediately, you don’t have to keep that information attached to the AST. This makes things much easier.

Again, I really think everyone in this space should familiarize themselves with clang-format. They’ve solved the hard problems and from experience with the existing Rust pretty printer there are fundamental design decisions that will bite you in the end if you don’t plan for them early on.

4 Likes

I’ve prototyped a pretty printer for rust in the last days. (it’s really an early prototype… it seemed like an interesting learning project)

First i tried the ast-approach, but it didn’t work out, mainly because of the reasons pswalton pointed out. After that i looked into his rustfmt and used a similar approach working on the tokens directly resp. on a custom token format more suitable for printing.

I think, working on tokens is the best approach. Sure, in the end it means writing our own parser, but this parser can be must simpler.

For this to work in the long term, we’re going to need some way of managing support for ‘stable’ and ‘unstable’ rust versions, as syntax changes in the language.

In theory backwards compatible, yes, but the reason we don’t have syntax extensions in 1.0 is because the AST isn’t nailed down.

I basically agree with this and I intend to follow this approach and see where it goes.

I don’t think I buy most of these arguments - you don’t need comments and whitespace in the AST, since we still have access to the source, which you can re-tokenise if you want tokens. So from that point of view, you get strictly more information by using the AST rather than a token stream. I don’t think the incremental/IDE argument is strong either - why would you need to format as you type? At worst I imagine that being able to format code which parses but doesn’t compile to be good enough. Formatting seems like the least important thing you need from an IDE with incomplete code.

The macro question is interesting. In that case you always have tokens, but I don’t think writing your own parser then is necessarily the best approach. I’d be keen to see if the rustc parser could be adapted. Example, if we expand a macro using dummy strings with the same size as the variable names, it should be parsable (obviously it wouldn’t compile).

The advantage of using the AST is that you simply have more information than the token stream, for example we might want to format the argument types in a Fn type differently from tuple types. It also makes it much easier to do transformative formatting - e.g., moving bounds to a where clause, or some of the tricks that Gofmt does, such as converting foo.len() == 0to foo.empty(). Finally, it makes it possible to use more type info from the compiler when we need it. For example, we could convert glob imports to list imports if the user desires. It should also mean more reuse between Rustfmt and future refactoring tools.

I don’t think that is much of an issue - unstable Rust should be a superset of stable Rust. I expect Rustfmt to evolve with the compiler. I don’t think we would back-port improvements in Rustfmt to work with older versions of Rust. So as long as Rustfmt stays up to date, then AST evolution isn’t a problem.

@pcwalton do you have any links to docs on Clang Format/libformat? The docs I found are mostly about using it rather than the implementation and were pretty sparse. The only design discussion I found was pretty old and advocated the AST approach - I’d obviously be keen to read about why they changed their minds.

I think careful thought should be given to whether rustfmt does give you any control over the output format, and realise the scope of that decision.

Go has declared one true format style, and gofmt enforces that. Every bit of code you encounter looks exactly the same as a result. They can get away with doing that because it’s a new language, with no pre-existing history of preferences and corporate coding styles.

The C/C++ worlds have been around for a long time, so clang-format was designed for a different use case - to format code to whatever style that user wants, because those desires are always going to vary.

With go, they’ve managed to completely eliminate argument over the One True Brace style or the correct amount of indentation - there is only one correct way to format go code. I have yet to see anyone complain about this decision, everyone seems very happy that all code looks like they expect it to (disclaimer: I’m not much of a gopher). The same opportunity exists now (and only now) with Rust.

There are other advantages than the ‘readability due to familiarity’ one. If there is only one way to invoke rustfmt, then you can happily run it before check-in and not worry about it changing any parts of the codebase apart from the ones you’ve touched, just because you’ve got a slightly different set of rustfmt rules.

2 Likes

Ideally, rustfmt would let you commit code in “canonical” style, but check it out in “personal” style. For example, 100-column lines are an Abomination Unto Nuggan, but if I never have to deal with them myself, that’s not so bad. So is “indent arguments to after the paren”, but anyway…

1 Like

Imo it’s extremely valuable to have a accepted normal form for code. gofmt may be the single best decision the go team has made.

[quote=“DanielKeep, post:30, topic:1685, full:true”] Ideally, rustfmt would let you commit code in “canonical” style, but check it out in “personal” style.[/quote]

This would not work for reading code on Github, for example.

On one hand, consistency makes for easier learning and readability — you’ll know what the code looks like. On the other hand, different people have different tastes, devices, etc. What looks great on my Full-HD-Notebook may look bad on your mid-range tablet or her mobile phone, for instance.

As Ralph Waldo Emerson said, “A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines.”

However, there is another reason to define and format to one style: It makes the formatter code easier to test, write and maintain. So unless it is really simple to add a configuration setting (for example column width should be fairly trivial, and having it configurable may help to weed out errors), I’d say add it later if at all.

Would it count if I complain up-front? There are a couple parts of the style that I would rather be optional, mostly indentation specific. If my only option is to not use the tool at all, rustfmt won’t have a chance to work on the styles that I fully agree with, and it would be a fully manual project.

Besides, I would’ve assumed rustfmt would end up with a per-project configuration file. Makes it both flexible and easy to keep a whole project in sync.

I’d suspect that at least the number of spaces (or tabs) used for indentation should easily be configurable.

That said, would you rather have a restricted rustfmt now or a fully configurable rustfmt later? I suspect the correct answer is ‘both’.

you're not really typing, you're copy pasting a block of code in your file from IRC or something and you want just that block of file to be formatted (your file doesn't even parse yet), so you select and hit the auto-format to format JUST that block of code

To evaluate the question lexer vs. ast I've made this:

At least it works well enough to format its own sourcecode and it can handle snippets and "wrong" code. There are some issues i know of (and surely a ton that i don't know of...), but i think it's viable and beneficial to just use the lexer output.

3 Likes

@madmalik nice! That looks pretty cool.

In terms of evaluating the lexeme strategy, I’d like to better understand the boundaries - it seems that you’ve demonstrated well that it can work, but I’d like to understand what the hard cases for the approach are, and how we can deal with them down the road. Perhaps you understand this already and can explain, or perhaps it requires more investigation.

I don’t myself know what the hard cases are. Some ideas of generally hard things for a formatting tool: macros (this is meant to be one of the strengths of this approach though), deeply nested expressions, expressions with lots of small arguments (e.g, x.f(a, b, c, d, e, f, g, h, i, j, k) or lots of long arguments, or a mix of long and short at different levels of nesting.

Some things which might be particularly difficult for this approach (this is pretty much guesswork at this stage): items with similar patterns of tokens with very different significance, e.g., a function declaration which takes a generic parameter which is a sugary function type; working in a checking mode, rather than a pretty-printing mode, or a mode where the tool only formats if the code violates some formatting rule (not to say that such a feature is necessary, but it might be desirable, and I want to know the limits of the approach).

Why do you find the approach beneficial? (Did you find new advantages in your experiment?)

Do you have an idea at what stage you’ll implement your own fuzzy parser? (I’m kind of going off what pcwalton said above, I’m not sure how big a leap this is from your approach). Do you know how much of an AST that will create? (any at all?)

I'd like to understand what the hard cases for the approach are, and how we can deal with them down the road. Perhaps you understand this already and can explain, or perhaps it requires more investigation.

I think that the hard cases are code transformation. For example, in match arms commas are needed on single statements that are not put in curly braces and are not in the last position, otherwise they are optional. It's really hard to erase this optional commas, because the decision depends on lots of context information and can't be made only by looking on the tokens before and after the comma.

To illustrate that, the decision rule "no commas after a closing curly brace when i'm in a match block" breaks down on cases like these:

match foo {
    Err(e) => 
        match e {
            ...
        },
    Ok(_) => {}
}

because the comma is needed because it comes after a single statement, while this single statement ends with a curly braced block. Maybe the style guide should say, that match arms whose content is a single statement or expression, but span over several lines should itself be wrapped in curly braces, but the transformation to wrap it in curly braces automagicly would be equally hard for a lexem based pretty printer.

I suspect all this would be easy when working with the ast.

Other hard cases are ambiguous tokens like *, <, >, -. These must be formatted differently in respect to the context they are in. With the lexeme based approach, one must rely on heuristics to decide these.

Generally i would say, the more sophisticated the transformation is we would like to perform and the more context is needed, the less suitable is the lexeme based approach.

On the other hand (that is the part I found beneficial) the simplicity is intriguing. In the end, it's a handful of rules where to inject whitespace and linebreaks and that's it. Also, it is not that hard to work with pieces that have no syntactic meaning like comments. And it will produce something okey-ish on snippets and incorrect code.

Do you have an idea at what stage you'll implement your own fuzzy parser?

I do track "context" with a stack. For example, when encountering an opening brace I put a new context on the stack and remove it when the closing brace is found. This is used for correct indentation and application of the correct whitespace-injection rules. This can be seen as some kind of rudimentary bottom-up parsing, even if no real ast is created.

@madmalik: Wow! Great job.

I’ve looked at some of the rustfmt code (which was styled by rustfmt):

 use syntax::parse::lexer::{
     StringReader,
     TokenAndSpan,
     Reader,
 };

[source]

and I think linked ain’t per style guide. IMO it should be:

use syntax::parse::lexer::{StringReader,TokenAndSpan, Reader};

or above spread in multiple lines.

I actually prefer the way his formatter has done it. Much more readable, especially once you have a lot of imports. I hate the current style that wraps and indents all the way to match the first {.

Additional benefit is that diffs are better.

3 Likes