Pre-RFC: fully structured rustc diagnostics towards l10n


#1

Note: this is about i18n support for the rustc JSON output, not i18n support in the language itself nor the terminal UI of rustc. i18n should always be done at library-level as it’s ultimately a domain-specific concern.

Summary

Make rustc’s diagnostics fully structured thus i18n-aware, so that meaningful i18n and l10n work can start across the ecosystem.

Motivation

The obvious; outreach, greatly smoothened learning curve for non-native speakers of English, so on.

Present situation

We already decided to not handle i18n and l10n at the language level long ago (maybe since the i18n format got removed from format!?), and rustc nowadays doesn’t support any kind of them. A machine-readable output is present in rustc, namely the --error-format=json flag, but it’s not directly usable as an aid to l10n in its current form, despite machine-readable output formats being especially strong indicators for i18n readiness. Namely:

  • many errors and (all?) warnings don’t have unique diagnostic code assigned, and
  • the parameters are rendered into the template message strings instead of being separate,

so that tool writers are bound to have a difficult time processing the messages if they want to tinker with them in any non-trivial way, especially i18n and l10n. This is unfortunate, and it hurts adoption from those developers who may have difficulties reading English beyond keywords used in programs. (Hint: the percentage is not negligible in East Asia for example.) Also, as almost any structure is better than plain strings, tool writers across the ecosystem would all benefit as well from the improved diagnostics. The command-line UI is not touched at all, so the impact of the change should be minimal.

(To solve the international outreach problem an overall design is required, but let’s leave it for another RFC. I believe the core team should be in a better position to put forward such an RFC than me an individual volunteer.)

Design

Firstly, diagnostic messages generated by calls like span_err() are unstructured and needs to be entirely moved to structured counterparts, with the diagnostic codes assigned. A policy of only allowing structured diagnostic messages should be adopted, as the unstructured diagnostic calls should preferably be removed and unstructured messages will be impossible to construct then.

Then, extend the structured diagnostic message macros to preserve the parameter mapping.

Finally, extend the JSON output with addition of a “template parameters” mapping field. For compatibility with already existing tools, the current pre-rendered message field is not dropped.

Drawbacks

More work needed to be done when creating new diagnostics.

Alternatives

Do nothing

Just keep everything unchanged. People who find compiler messages hard to understand just have to learn themselves some English instead of retracting to their comfort zone.

Parse the messages back into template and parameters

A great number of message templates have backticks around code snippet parameters, and backticks are not used in Rust. So it’s trivial to parse the parameters out of the rendered string.

However, not all messages are formatted this way, and special cases are undesirable.

What about dropping the pre-rendered messages from output?

This is certainly doable, and will make the output very concise and elegant. Let’s transform eveything into codes! However, it has a potentially serious side-effect, that the compiler is no longer the single source of truth for the human-oriented rendered messages, but rather only for the individual combinations of diagnostic codes and parameters.

To save tools from rolling their own message renderers and keep the messages consistent across tools, a separate crate would be created to provide the reference message rendering, preferably officially maintained. Ideally, as the diagnostics are assembled during rustc compilation, such a crate could be auto-generated along with rustc itself. But it’s a nontrivial amount of addition to rustbuild, so we’d rather not go this way for now.


rustc_errors::diagnostics::Diagnostic codes
#2

As a fun afternoon project I threw together a cargo plugin that returns sarcastic error messages based off a reddit comment. Pretty quickly I realized that it was basically an i18n problem, just translating the error messages into a different speech pattern rather than a different language.

As well as the lack of structured parameters in the json output (which I ignored by just adding suffixes onto a couple of messages), the biggest thing I ended up having to implement to get even a single message output was the rendering around the messages, bolding/colourizing the correct text and formatting all the spans together.

It feels like there’s 3 major stages to outputting each message; gathering all the context (error code, parameters, spans, etc.), producing translated strings (the main error message, labels for the spans, etc.) and outputting the result somewhere (console, IDE, etc.). At the moment you have the option of rustc performing the first + second (with json output), or all three steps (by default). But, if you have rustc do just the first two steps, then perform some string matching to try and translate the pre-rendered strings, there’s no way (at least that I could find in a few minutes searching) to feed it back into rustc to do the third step for you.

It would be great if there were an easy way to drop in a replacement for just the second stage, if rustc's json output was enhanced with the rest of the context needed to render the messages, and if it provided a way to feed that json back in with the rendered messages and perform the console formatting and layout necessary; then it would be possible to work on translations externally to rustc while still getting all the nice error layout that rustc has.

EDIT: Just saw rustc-l10n and noticed you have the exact same issue of having to duplicate rustc's rendering :slight_smile:


#3

See also this issue about making a general purpose l10n library: https://github.com/rust-lang/rust/issues/14495


#4

I now have some more thought about the proper way of integration now that sufficient time has passed.

Integration with rustc

Basically what I wanted to do originally was like rustc -> json -> localizer -> output, however there’s inevitable problem of duplication of rendering logic, as @Nemo157 (and myself) noticed. So there have to be a better way to integrate this.

We could specify an interface for rustc to dynamically load a localizer, sort of like a compiler plugin, except not involved in actual compilation. Then tell the struct_span_xxx! macros to consult the localizer plugin before actually outputting anything. And voila, a localized rustc with localization rules out-of-tree while staying synced with the compiler!

Relation with RLS

As such a localizer will be yet another out-of-tree component tightly coupled with compiler internals, I thought whether this should be merged into the RLS. My hunch is that we shouldn’t do this, at least not until RLS is absorbed into rustc itself; also the degree of integration required is greater than a C/S architecture could provide. So IMO this localizer effort would go under a separate project.)