I’d like to understand what the hard cases for the approach are, and how we can deal with them down the road. Perhaps you understand this already and can explain, or perhaps it requires more investigation.
I think that the hard cases are code transformation. For example, in match arms commas are needed on single statements that are not put in curly braces and are not in the last position, otherwise they are optional. It’s really hard to erase this optional commas, because the decision depends on lots of context information and can’t be made only by looking on the tokens before and after the comma.
To illustrate that, the decision rule “no commas after a closing curly brace when i’m in a match block” breaks down on cases like these:
match foo {
Err(e) =>
match e {
...
},
Ok(_) => {}
}
because the comma is needed because it comes after a single statement, while this single statement ends with a curly braced block. Maybe the style guide should say, that match arms whose content is a single statement or expression, but span over several lines should itself be wrapped in curly braces, but the transformation to wrap it in curly braces automagicly would be equally hard for a lexem based pretty printer.
I suspect all this would be easy when working with the ast.
Other hard cases are ambiguous tokens like *, <, >, -. These must be formatted differently in respect to the context they are in. With the lexeme based approach, one must rely on heuristics to decide these.
Generally i would say, the more sophisticated the transformation is we would like to perform and the more context is needed, the less suitable is the lexeme based approach.
On the other hand (that is the part I found beneficial) the simplicity is intriguing. In the end, it’s a handful of rules where to inject whitespace and linebreaks and that’s it. Also, it is not that hard to work with pieces that have no syntactic meaning like comments.
And it will produce something okey-ish on snippets and incorrect code.
Do you have an idea at what stage you’ll implement your own fuzzy parser?
I do track “context” with a stack. For example, when encountering an opening brace I put a new context on the stack and remove it when the closing brace is found. This is used for correct indentation and application of the correct whitespace-injection rules. This can be seen as some kind of rudimentary bottom-up parsing, even if no real ast is created.