Multi-lingual support through Editions

I'm a huge rust fan that has had some ideas spinning in my head and wanted to share to see if they're feasible and interesting to folks here.

I'll start by mentioning where this idea came into my head. I was listening to Command Line Heroes Season 7 Episode 3, when the guest Gretchen McCulloch points out how many programming languages are written in english or use english friendly syntax. It was something I hadn't ever considered and encourage folks to give it a listen.

A few weeks go by, and as part of a personal project, I stumble upon ABNF and BNF, tools to describe a syntax for a given language. I started looking into compilers and started to realize, you could generate a parser for any (valid) set of ABNF rules. I started to wonder if Rust, with its edition system which supports "interop" with previous editions, could be used to support multiple syntax's of the language that are more natural to other (non-programming) languages (ex: Spanish, French, Mandarin, etc.).

With the background, I'll try and write out the general proposal.

  1. Rust will expand it's edition support beyond just 2015, 2018, and 2021 to include 2021-sp, 2021-fr, 2021-etc as well.
  2. Each edition (moving forward) would map to a set of ABNF (or similar) parser rules that can be used to generate the parser for the given edition.
  3. The rust compiler uses the corresponding parser for the given edition of the crate based on the edition to generate the AST.
  4. The rust compiler continues to do what it currently does with the AST after Lexing/Parsing

Pros:

  • (Potentially) easier edition support in the future. Simply add new Rule/remove rules to a given parser
  • (Potentially) easier feature support, as again, it would be defined based on a generic Rule
  • More inclusive language. Helps grow the community and language even further than where it is today.

Cons:

  • (Potentially) Maintenance. How easy will it be to manage and maintain all of these generic rules, especially when a lot of the rules will represent the same thing with different syntax (ex: "for" and "por" both defined for looping rules)
  • Backwards compatibility. This feels like a re-write of the parser today, at least for future editions
  • Implementing rule specific parsing features. One of the things the Rust compiler does EXTREMELY well is tell the programming where things went wrong when compiling AND how they might fix it. I'm not sure how this is implemented and how easily/feasible it would be to implement that for these generic rules based parsers

Overall, I may be solving a problem that doesn't need to be solved and adding more trouble than it's worth, but I can't shake this idea until someone tells me it's not going to happen :slight_smile:. I've found the Rust community to be very welcoming and felt this may be a way to expand the language and welcome more people to enjoy the language and programming.

1 Like

Translating a programming language itself will make all example code, tutorials, ... of one language pretty much useless for another language. It will also make code written in say the Japanese version completely unreadable for someone that doesn't understand Japanese like me. Currently it is still possible to somewhat make sense of code even if comments and identifiers are in a language you don't speak. Also translating the syntax would only be a tiny part of translating a language. You would also have to translate the standard library, which has a much larger surface area. How would you deal with partial translations as new functions are added? And how should we determine which languages to translate to. Translating to all existing languages is unfeasible and only covering a subset will make people whose language rust isn't translated to feel more excluded than they feel now I think.

For context: English is not my native language. I am using English as language for many programs though as translations of technical terms are often somewhat poor, and it allows me to use tutorials without having to guess how certain terms were translated to my native language.

5 Likes

Another thing: When I started programming I didn't have much of a clue what certain keywords actually meant in English. At that point it was just some magic word that made the computer do what I wanted. Tutorials (and somewhat less strongly type and function names) are much more important to translate when you want to teach someone who doesn't understand English a programming language. I bet you can still understand the following with a bit of effort:

功能 斐波那契(n: 整数) -> 整数 {
     如果 n > 1 { 斐波那契(n-1) * 斐波那契(n-2) } 别的 { 1 }
} 

(Created with help of google translate. I am using Chinese as I presume you don't speak it and the non-latin "alphabet" would make it even harder to read for people used to the latin alphabet, not to offend anyone)

Less so if this function were to use translated standard library functions.

Please don't do this. It has come up before, and it bears many (perhaps non-obvious) serious downsides, and it basically doesn't help beginners as much as you think it does. More on this here.

4 Likes

Thanks H2CO3, that was a very interesting read. I still think the root idea here would be interesting, but I think the concerns called out from that older thread are valid.

I'm curious if there would be value in updating the Rust parser based on ABNF/BNF Rules even if there wasn't a "2021-FN" edition. I'm curious if this would simplify future edition releases? Maybe even tie into the feature/macro rust features (which I will acknowledge I know little about) so end users could define ParserRules trait implementation to support these features/macros?

One of the ideas coming out of that thread was for people to create a fork/tools outside the core tools to PoC and explore this area. This might simplify that effort and also help document/define clear parsing rules for the language as a whole.

(speaking as a member of wg-grammar, but not for the group)

There is a working group whose intention is to produce a formal, executable grammar for Rust. Unfortunately, while making a grammar is simple enough, making a useful grammar is much more difficult, so most of the work has stalled.

The main reason for this is that the way that a human thinks about a language grammar, in a non-deterministic recursive descent kind of way, and the way a formal grammar is structured diverge significantly. The most obvious example is left recursion, which can be automatically eliminated, but there are many other cases where the grammar definition has to diverge from the syntax tree you want to get out, in order to communicate the unambiguous grammar to the computer.

(It's for this reason that I make a formal distinction between a "parse tree", the tree generated by a parser, and a "syntax tree", the logical syntax tree used by a human modeling the grammar, and ideally by later compiler passes. With a hand-crafted (or nonpure generated) parser you can discard the unnecessary structure of a parse tree and generate the syntax tree directly.)

In addition, no existing automated parser generation scheme supports the level of error recovery and diagnosis that rustc's existing parser provides. Any replacement scheme for the existing parser has to at a minimum support the level of care that's gone into the existing parser, if not make it easier to re-add all of the assists, to avoid losing functionality.

A key part of this is that there's effectively two Rust grammars: one which describes well-formed Rust code which can be compiled, and another superset of that grammar which describes what the compiler can accept and suggest how to massage it into the stricter format. This latter grammar is much more vaguely defined, so is much harder to gain the benefit of in a formal system.

We effectively have three distinct production Rust parsers in use today: rustc's, syn's, and rust-analyzer's. These are enough that I feel we understand the problem domain of parsing Rust fairly well, and while having a formal grammar for what should be accepted as well-formed Rust code would be nice, it's not a priority, and having a formal definition is not useful unless that definition is useful (as a human reference, as an executable test suite, or otherwise).

rust-analyzer uses a semiformal grammar to define the syntax tree (not the parser) they accept and process, which is a strict superset of well-formed Rust code (so they can process in-progress code without losing functionality). You might also be interested in efforts to merge parts of r-a and rustc, such as the syntax tree itself.


Personally, I think it's possible to make a language grammar that is all of comfortable to use (with a C-adjacent bracket syntax), supports great error recovery, and uses a formally generated parser, especially with advances in algorithms such as GLL and GLR that allow parsing less restricted grammars. However, the grammar would have to be designed with the implementation in mind, to play to the algorithms strengths and avoid its weaknesses, to get the best result. Rust's grammar has grown organically with a handwritten parser for a good while; as such I think the best Rust parser will always be a fully handwritten one, at least until another breakthrough in generated parsers.

(Personally, I think GLL has a better shot at producing good human error messages, as humans tend to think top-down, while LR family parsers work bottom-up. But I've worked with both GLL and GLR only super superficially, so I don't know.)

2 Likes

Parsing is not really the main barrier to this sort of thing, the documentation is. If you had, e.g., a high quality copy of the rust std library written in another language except for the code, you'd be well on your way to having achieved most of what you're asking for. The code isn't the problem. serde, rand, syn, libc, bitflags, log, tokio... If these are all documented only in English, you're not making an English-comparable experience of the language. Also cargo, clippy, rustfmt, IDEs...

And if you've got high quality translations of all of that, I'd think the bulk of the real work is done, and it should be much easier to motivate the language team to do something with rustc. otherwise, you're just changing Rust's syntax to cater to an empty void of frustration.

1 Like

@CAD97 Do you know where I could learn more about this formal grammar working group, this does seem interesting to me. I would agree, as you stated, that this tool would need to be as good or better than the hinting of the rust compiler today to be a practical replacement. I could see a world where rust core is using a hand written parser for this great hinting, but it still defines a strict grammer that another tool (maybe babelcrab to build off the babelfish idea from JS in the previous thread) could leverage to support more native languages. Not sure it's fair to expect the core team to maintain these rule definitions when they're not directly using them, but again, maybe they can find other uses for them without implementing this translation tool.

@skysch I don't disagree with your point that translating documentation would largely be more helpful than translating code, but ironically, I think that's a much harder problem to solve. The number of rules defining human languages will likely be orders of magnitudes larger than the rules of any programming language. I'll also add, I know I often look at source code instead of documentation to understand how a tool/library works as I find it can be similarly effective if the code is written well and unlike documentation does not get outdated or lie :slight_smile: (This is also ignoring the numerous libraries/tools that exist with little to no documentation where the only way to learn them is to read the source code).

All that said, I have removed the [Pre-RFC] from the title of the thread as there are building blocks required before this would even be feasible to implement.

One tangential thought on this thread since I tend to do a lot of thinking and little useful work :slight_smile: One area I've been looking into later has been wasm. One thing that interests me greatly about wasm is the idea of simplifying interoperability between languages. (The improved performance in the browser is great, but I'm much more interested in writing static typed code for a webpage in a language like rust or even support inter-op outside the browser with wasmtime or similar). My understanding is there are a number of people with similar feelings about wasm based on this video from the wasm summit 2020. However, wouldn't a lot of the fears from from the thread @H2CO3 posted hold up in a multi-(programming)-lingual browser? If a library I want to use is written in say C, I need to learn C if I really want to understand that library. Maybe this is fine b/c these libraries will still be written in English based programming languages and I will be able to read their English documentation, but as mentioned above, I tend to follow the "trust, but verify" approach to code documentation. I'm probably missing something, but I don't understand why folks would be excited about inter-op between programming languages, but be much less interested in inter-op between human syntax's?

Maybe I'm dreaming for too much, but I can imagine a world where

  1. every programming language can compile to wasm
  2. wasm can be reverse compiled to Rust (I recall hearing with wasm being an open spec reverse compiling is possible, but this maybe another over simplification. I like to do that if you couldn't tell already :slight_smile: )
  3. every rust program can be translated to a "native" version of rust in human languages

That's a lot of overly simplified steps, but if those steps exists, everyone could theoretically read any code in their native human language. Translated documentation seems better, but this ironically seems more systematic (and therefore easier) to me :slight_smile: