Completing rustfmt and the Rust style guidelines

Hey friends.

There’s a feeling in the air, and I hope you feel it too. rustfmt. We all want it.

Several people have expressed interest in writing rustfmt recently, and I think they can make this happen if we build a little momentum, so I’m opening this thread as a place for those interested to colloborate.

The essential goal is to implement the style guidelines. All we need is a command line tool that takes a source file and produces a source file.

Part of this effort will also entail completing the style guidelines so that they and the tool agree.

The big question is whether to base it off tokens, the ast or a hybrid. There’s no consensus on this but personally, I’m inclined to start from the token tree and add in more semantic information when it becomes necessary.

@ahmedcharles has revived @pcwalton’s rustfmt prototype as a starting
point.

Some potential next steps and discussion points:

  • Is pcwalton’s rustfmt the right base to work off of?
  • What is the state of pcwalton’s rustfmt? How close is it, what needs to be done next?
  • What are the hard known problems and how are we going to solve them?
  • Who is going to step up and lead us into the bright rustfmt future? How can we devide up the work?
  • Style guide can be worked on in parallel.

Let’s make decisions quickly and move forward!

cc @nrc

8 Likes

Just for clarification: A rustfmt tool will be an all-or-nothing deal if I understand the plans correctly. Is that right? As in, the inability to opt-out of certain styles is regarded as an explicit feature?

For example, would an option to not reformat type and function parameter (Edit: fixed from “argument”) lists be accepted, or is the only other choice to not use rustfmt at all?

Edit: “arguments” -> “parameters”

My goal would be to make the same fundamental decisions that clang-format did and to enable the same scenarios that it does for C++, only in Rust.

That would imply that the answer to your specific question is both yes and no:

Yes, because there would hopefully be options to limit reformatting to a set of lines. No, because if you do reformat a line with a function argument list, you will end up with either all on the same line, all on the next line, each on a single line or bin-packing. There isn’t an option that would leave the list in it’s original state, by design.

Hello all, Here are my own opinions to the questions.

  1. If we decide to work based on token tree, current rustfmt will be a good start. However I personally think ast is better because we will need semantic information. For example, “*” could be either “multiply” or “dereference” and “|” could be either "binary or “closure”. Also, both clang-format and gofmt are ast based.

  2. Main challenges

    • Correct line break. // Current rustfmt doesn’t insert line break when one line is too long.
    • Correct indentation.
    • Correct whitespace. // I saw related code but seems just add whitespace after operators. For example, whitespace after comma, colon. In one line blocks/structure expressions, we need one whitespace after opening brace and before closing brace.

Sorry, I’ve messed that up. I meant parameter declarations, not in-expression arguments. This should be doable even with an option to reformat only expressions/statements inside function bodies, and not the declarative surroundings themselves.

So…

To start off, I think there are two architectural choices to make - tokens vs AST and pretty printing vs heuristic adjustment. I think they are orthogonal.

Tokens vs AST is not really the right way to frame the decision. In both cases we work off an AST for most situations and fallback to tokens or source text in some instances. The ‘tokens’ approach is really building our own AST (as opposed to using the libsyntax AST). The advantage of the former is that we can optimise for the job in hand, rather than using a more generic AST. However, the disadvantage is that we don’t share code with the compiler and thus keep things inline together. How this manifests is that the ‘tokens’ approach is able to cope with code which does not parse - e.g., snippets or non-compiling code. It also makes handling macros easier. The ‘AST’ approach makes it easier to leverage more information from the compiler to make decisions.

My personal feeling is towards the AST approach. I think formatting incomplete code is a distraction and should be a non-goal (someone change my mind with a good use case :slight_smile: ). I think macros can be handled just fine using the AST approach - we just need to handle them pre-expansion. Furthermore, I see the end game for Rustfmt involving moving a bunch of the style lints out of the compiler and into Rustfmt. That would require more compiler info, and (I think) requires we have knowledge of the AST.

I believe most existing tools take the pretty printing approach. I prefer the heuristic adjustment approach. Importantly, the pretty printing approach has to be perfect before it is useful. Whereas, an heuristic approach is useful straight away. It also allows us to use the layout of the source more easily as input. Pretty printing can never perfectly insert line breaks, so if someone has already formatted their code in a style-rules-conformant way, re-formatting that according to pretty printing rules seems like a bad move.

Orthogonal to these questions is how customisable the tool is. I believe it should be customisable (like Clang Format) rather than strict like Gofmt. Although the latter saves arguments, it seems ill-adjusted to the real world - people might want their Rust formatted like existing code or might have strong personal preferences. They will not use this tool if we go for the strict version. I would like the tool to be useful for the most people so I err on the side of pragmatism. However, it would be easy to add configurability later, so we don’t need to worry about it for now (as an aside, I think it would be interesting research to be able to formally describe style rules for such a tool, rather than using ad hoc constraints - e.g., style by example (analogous to macro by example). Styling is really the counter to parsing - it is a set of rules which map from parsed code to tokenised code, whereas parsing is the other way round, so it seems there should be some counter of a formal grammar for specifying styling, anyway, I digress).

I would like to put together some test cases of awkward code and how it would be formatted in order to get a better idea of what information we need, these would be useful in a test suite too. I’m not exactly sure what would make something awkward, but lots of rightward drift would give most pretty printing algorithms trouble.

I have a prototype using the AST and a heuristic adjustment approach. The key element is representing the source text using a modified rope which keeps track or original source positions as well as current ones, that makes it easy to build on previous adjustments, whilst using the compiler’s spans for input. Unfortunately it is a mess. I haven’t had time to work on it for a while. I’ll try and tidy it up and put it online very soon…

4 Likes

Agree that AST makes life easier. nrc: I am not sure about the point of what is heuristic adjustment approach. Could you tell me more about it?

BTW, I suggest to have a design document (e.g. Google Doc) for the tool so that we have a clear roadmap and track the progress. Could anyone create one and share it, or just use my existing doc?

Here is the link of my previous design doc. I can grant edit privilege to people who are interested in editing. https://docs.google.com/document/d/1Nes-4I85PZLlW5R9BwSanzRufzDiEUug1sDf1wMlb4Y/pub

The goal for me is to basically make a clang-format clone for Rust, which suggests using the same basic algorithm and overall design decisions since I basically agree with them all. And since I’m familiar with Clang/LLVM development, it’s easy to just look at the code and use it as a guideline.

As far as scope goes, I’m not interested in doing lints or parsing or anything more than adjusting whitespace between tokens. For example, clang has analogous tools for things other than formatting, like clang-modernize, etc.

As far as pretty-printing vs heuristic adjustment. I don’t have a strict opinion, however I like that clang-format can be limited to only working on certain offsets/lines/etc. This allows the user to have code that is specifically formatted remain the same through use of a tool like git-clang-format, for example.

For the moment that is ok, but think about a future Rust IDE, which should be able to format erroneous or incomplete code. Creating a 'lenient' AST parser that could also be reused for syntax highlighting etc. would bring its own benefits and could pay off handsomely in the future.

I think that in that case, IDE makers generally prefer to make their own separate parser, optimized for partial/incremental parsing, no?

IDEs need to deal with things like unclosed quotes and brackets. In theory by knowing the history of recent edits they can do this more smartly than a tool without this information, so not trying to deal with this case makes sense.

On the other hand, formatting snippets (e.g. function bodies) could be useful.

Working in an IDE is such a long term goal that it would be distracting now. We are so far off having incremental compilation working that incremental formatting is crazy talk. Hopefully one day we will get ‘incremental’ parsing for IDE support and then we can use that to power formatting too.

3 Likes

So my prototype implementation is here: https://github.com/nrc/rustfmt

Where is works, it works well in that it copes with varying input and produces nice output, but the style guidelines it enforces are very limited: it (incompletely) formats formal arguments to functions, it formats string literals, and it formats use lists.

I think though, at this stage, if I were to continue with the project, I would start again rather than trying to extend this.

Thinking a bit about requirements, I think the following are different changes from the gofmt model:

  • I would like to be able to identify style violations as well as correct them. This means analysing the source code, rather than just re-printing it.
  • I would like to use the source code as heuristic input into the pretty printing algorithm. Even while sticking to the style guidelines, there are usually multiple ways to correctly format code, so if code is correctly formatted, I do not believe it should be re-formatted to some ‘optimum’ style. Likewise, even if we are going to change code, then I think the programmer’s existing code can give good hints for how to do it.
  • I would like formatting to be configurable.
  • I would like development to be incremental - the starting point should be the identity translation and as we add code, we should make the output better. The starting point should not be very different, but badly formatted code (like our current pretty printer) since then the tool has to be perfect before it is useful, and that will take a long time (because software).

I still think these are good requirements. I wonder if the first makes things unnecessarily difficult though. We could easily produce formatted output and diff against the original to show style violations. However, that would not give explanations of which guideline was violated, and would not be very readable.

The reason I propose starting over rather than extending that prototype is that I think the approach of changing text is complex and unnecessary. We could satisfy all these goals by having a more functional approach, rather than modifying text in place.

My second worry, is that although the approach of using ad-hoc code for each expression and using the existing source as input gives (IMO) pretty good output, it is not nice code - there is lots of it and there is not much reuse. This contrasts strongly with traditional pretty printing algorithms where code is very generic (and straightforward for each expression), but the output is less optimal.

I’m not sure how to weigh these two factors against each other.

1 Like

Since the last post I changed my approach to something a bit more pretty printing-like - I no longer use the Rope to modify source, but just print it out. It is still driven from the AST rather than token streams. Where I don't yet format, I use existing source. This makes it an incremental approach. It needs some work till it can be used on any Rust code though (for example, we need to properly adjust indent for all items, I currently only do it for blocks and impls).

I am fairly confident that this is a good start towards a working tool. Although the question of tokens vs AST has not been completely settled in my mind.

If anyone is interested in helping I'd really appreciate contributions. The easiest (and probably most interesting) way to get stuck in is to implement formatting for some expression or item. There are a bunch of issues filed on the repo too. (Good regression testing is probably the highest priority 'big ticket' item).

For an idea of what it can do already:

before:

use std::cell::*;
use std::{any, ascii, self, borrow, boxed, char, borrow, boxed, char, borrow, borrow, boxed, char, borrow, boxed, char, borrow, boxed, char, borrow, boxed, char, borrow, boxed, char, borrow, boxed, char, borrow, boxed, char, borrow, boxed, char};

// sfdgfffffffffffffffffffffffffffffffffffffffffffffffffffffff ffffffffffffffffffffffffffffffffffffffffff

                                                         fn foo(a: isize,
     b: u32, /* blah blah */
         c: f64) {

}

fn main() {
        let rc = Cell::new(42us,42us, Cell::new(42us, remaining_widthremaining_widthremaining_widthremaining_width), 42us);
    let rc = RefCell::new(42us,remaining_width,           remaining_width);  // a comment
      let x = "Hello!!!!!!!!! abcd  abcd abcd abcd abcd abcd\n abcd abcd abcd abcd abcd abcd abcd abcd abcd \
                   abcd  abcd abcd abcd abcd abcd abcd abcd abcd abcd \
                    abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd";  }

after:

use std::cell::*;
use std::{self, any, ascii, borrow, boxed, char, borrow, boxed, char, borrow};
use std::{borrow, boxed, char, borrow, boxed, char, borrow, boxed, char, borrow};
use std::{boxed, char, borrow, boxed, char, borrow, boxed, char, borrow, boxed};
use std::{char, borrow, boxed, char};

// sfdgfffffffffffffffffffffffffffffffffffffffffffffffffffffff ffffffffffffffffffffffffffffffffffffffffff

                                                         fn foo(a: isize,
                                                                b: u32,
                                                                c: f64) {

}

fn main() {
    let rc = Cell::new(42us,
                       42us,
                       Cell::new(42us,
                                 remaining_widthremaining_widthremaining_widthremaining_width),
                       42us);
    let rc = RefCell::new(42us, remaining_width, remaining_width);  // a comment
    let x = "Hello!!!!!!!!! abcd  abcd abcd abcd abcd abcd\n abcd abcd abcd abcd abcd abcd abcd \
             abcd abcd abcd  abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd \
             abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd abcd \
             abcd";  
}

Note the imports are tidied, the string literal is adjusted, the nested function calls are handled, and the formal arguments in the function decl are aligned (however, the rest of the function decl is not yet formatted.

You will need to copy and paste the before and after to see that it has been tidied properly, the limited width of the panel here obscures the formatting.

Do pastebin or gist preserve formatting?

I have no idea what I am doing(I wish I did!), but is a possible approach to take the whole AST, and then just recreate the file, but formatted?

I may be stupid, I would love to know how a good implementation would work.

That is basically how ‘the AST approach’ works - just pretty prints the AST, interleaving in any comments, etc.

Cool, are comments stored in the AST?

With the parser rustc uses, no.