Translating the compiler

See also: Translating the stdlib docs

Over the past few weeks I’ve helped integrate Fluent into the codebase for http://www.rust-lang.org (building off of Rui Zhao’s work), and it’s all working now. All the text in the website has been extracted out into Fluent FTL files and various translation teams are using Pontoon to submit translations. The Turkish translation has already been completed, and Korean and French seem to be coming soon!

I’d like to start seriously talking about localizing the compiler (i.e., any output from the compiler), as the next-easiest step.

This has been discussed before, at the time it was rejected due to a lack of bandwidth, but also I don’t think internationalization frameworks in Rust were quite ready at the time. I’ve been helping out the Fluent folks in the Rust design process and they’re sufficiently progressed for 99% of what we’ll need.

The framework itself isn’t hard to integrate, you can see the code for the website here. The basic idea is that you load FTL files for any language you’re interested in, build a “bundle” for each one, and then have your formatter use the bundle to pick the actual strings used.

I’ll post what the syntax could look like in a followup message.

Open questions I’d like people to discuss:

  • Overall project organization:
    • Do we have the bandwidth to do this at all? I can help integrate Fluent itself, and moving text into Fluent files is a very parallelizeable task. It’s a daunting one for the compiler, but it’s still parallelizeable.
    • Are we confident that we can come up with a policy good enough to lead to quality translations? Technical translations are hard. The website has a policy which is decent – each translation must be reviewed by someone else. We probably want something similar. Things that would also help would be to annotate the fluent files with as much context as possible (linking to the error index where possible) and require translators to understand what the error is about first.
    • Should we simultaneously start looking at localizing other tools?
  • UX issues:
    • How is language selection done? Some rust-specific environment variable? $LC_ALL? A command line flag that gets set by rustup (which itself has a config value in .rustup)?
  • Implementation issues:
    • Presumably we should lazy load languages as they become needed. The website doesn’t do this (it loads everything up front so there is less of a locking overhead), but presumably the compiler will need this for performance
      • It’s possible to have multiple lazy-loaded bundles per language, e.g. there can be one bundle that’s just used for typechecker errors, etc. This just means we won’t hit the costs of loading every single string the compiler uses the first time the compiler needs a string. I don’t know what the performance hit of doing this is, though, so this may be a premature optimization
    • Should the FTL files be compiled into the binary, or be in a folder as part of the dist bundle?
    • Should we just move error index stuff into a Fluent file and call it a day? There may be a different way to deal with the error index.

I believe that to a large extent this can be run without interfering with compiler work at all, though I’m not sure if I personally have the bandwidth to be the only person running it.

Edit:

See my comment below, the compiler team already hopes to move diagnostics over to “diagnostics structs”. If we use a custom derive or similar macro for it, it becomes really easy to incrementally migrate to Fluent, in a largely-automated way. It shouldn’t be much work at all to convert the system to using translation strings, however moving everything to diagnostics structs will probably still require help :slight_smile:

cc @ekuber @sebasmagri @oli-obk @skade @nikomatsakis @GuillaumeGomez

19 Likes

We’d probably be able to abstract out almost all of the Fluent glue code into a separate reusable crate.

The syntax could look something like the following:

// librustc_foo/code.rs
sess.struct_span_err(span, text!(foo-borrow))

// locales/en-US/foo.ftl
foo-borrow = Cannot move out of a borrow

with the ability to pass parameters:

// librustc_foo/code.rs
sess.struct_span_err(span, text!(foo-borrow, ty=ty.to_string()))

// locales/en-US/foo.ftl
foo-borrow = Cannot move out of a borrow of type $ty

including choosing different variants of a message, whether it be a number (with pluralization support!):

// librustc_foo/code.rs
sess.struct_span_err(span, text!(foo-abort, number=n))

// locales/en-US/foo.ftl
foo-abort =
    { $number ->
        [one] Aborting due to one error
       *[other] Aborting due to $number errors
    }

or something else:

// librustc_foo/code.rs
let kind = match item.kind {
   ItemKind::Struct => "struct",
   // ...
};
sess.struct_span_err(span, text!(foo-missing-doc, kind=kind))

// locales/en-US/foo.ftl
foo-missing-doc =
    { $kind ->
       [struct] Missing documentation for a struct
       [item] Missing documentation for an item
       // ...
    }
5 Likes

One side point, many of the technical terms that is used by the Rust language are quite unique, so different translators must make their own choice on choosing the proper words in different languages.

I wonder if it makes sense to create a glossary translation project that everybody using a specific language can agree on with naming things.

6 Likes

Since this is doable in a very incremental way, I don’t see any problems with bandwidth. We can have a tracking issue with many checkboxes for small parts of the compiler and then invite even new contributors to participate.

At least en-US I’d bake into the compiler binary, not sure if we should be adding more directly. There’s not much difference between having these in the compiler and just shipping them along with the compiler other than that rustup could ship them as a component so you only pay the cost when you actually need it.

Wrt the pre-loading of the FTL data, it does feel like a premature optimization. We should probably just start with preloading at start and see if it has an impact. It should be easy enough to just test this out with a random big FTL file

4 Likes

Yeah, I’m assuming en-US would be baked in. We could probably go further and bake the string in directly whenever the Fluent id is a simple string, but that’s probably unnecessary.

It’s been a while since we last proposed this, I’m super happy it’s coming back!

For the language selection, I assumed it could be done by detecting the OS language (I *think* it’s possible) and to provide alongside an option to override it.

For the implementation part, I think it should be compiled into the binary to avoid potential path issues.

A thing you didn’t talk about is: should we provide all languages at once for everyone or will the users have to pick which language(s) they want? Also, at which point do we consider a language bundle to be ready to be used?

I'm -0.5 on this since it creates an additional language barrier to communicate the error out, esp if turned on by default. (1) It hinders support across countries; instead of everyone knowing one common language, you'll need to know N natural languages to understand the terminology used in the other people's compiler output. (2) Googling the localized error message may not give you useful information. (3) Technical issues like console support, IDE support, etc.

Also I think it's still premature to put serious effort into compiler message localization; the standard library docs is a more important next step (if we have to serialize the efforts).


Anyway, if we proceed, please

  • Ensure every errors and warnings must have a language-neutral error code or identifier. Currently parser errors are infamously lacking error codes. Furthermore some generic codes like E0308 (mismatched types) require reading the notes to know the details.

  • Consider how Clippy and Rustfmt fit into the system when they're ready. This should be easy if all tools use the same diagnostic library.

  • Consider if we want to support translation for dynamic, library-provided error messages, i.e.

    • compile_error!() (probably no),
    • #[deprecated] (probably no),
    • #[rustc_on_unimplemented] (probably yes)
    • #[unstable] and #[rustc_deprecated] (probably yes),
    • errors from syntax extensions like format_args!() (probably yes),
    • errors from proc macros (:thinking:)

    — and if supported, how are the keys are looked up and where to find the FTL.

  • (There might be some issues with RTL langauges, but I can't think of any real issues, as the entire output structure is still LTR.)

Big -1 on $LC_ALL alone since it affects the entire system not just rustc. rustup could seed the initial choice from $LC_ALL but rustc should not rely primarily on $LC_ALL. There should be an easy way to fix the compiler output to English regardless of my OS interface language (and vice-versa).

There should be a way to easily and permenantly change the display language with a single command, so I prefer a config value in a setting file. Since the rustup proxy have never added a command line or env var before (RUST_RECURSION_COUNT doesn't count), it's more appropriate to let Cargo do it.

  1. When installing rustup, allows changing display language (either en_US or $LC_ALL by default)
  2. Write that setting into ~/.cargo/config (or ~/.cargo/language, but not ~/.rustup/settings.toml).
  3. Add a command like rustup language set zh_TW to change the language (which overrides the value in ~/.cargo/config)
    • Additionally provide rustup language list to list the available languages.
  4. When cargo invokes rustc/rustdoc/(clippy/rustfmt?), pass that additional env var or command line to the tool.

Given that distro sometimes separate localization into separate packages (e.g. https://packages.debian.org/buster/gcc-8-locales), I don't think the FTL files should be embedded into rustc at all.

This could be merged into annotation-snippets-rs? :wink:

22 Likes

I think this is adequately addressed by making this require a rustup language set fr or whatever

It's worth noting: a lot of tools have support for this and people use the localized versions.

Rustdoc doesn't have enough people working on it to start this IMO. We could totally start doing this in parallel: You need to basically give each doc comment in std an id, autogenerate an FTL file based on that id using some tool, teach rustdoc to generate docs for other languages when requested, and eventually make infra automatically run this for each language. You can even be clever and generate an id here.

Yes, that's the plan, it's just that I'd like to handle rustc first, as people have been asking for this for a while. (Cargo would be the next logical step, clippy and rustfmt are both optional)

This can be worked on. I don't consider this to be a blocker, this has been an issue anyway. Many of the parse errors are not usefully searchable.

Yes, this would be done

Other tools use LC_ALL/LANG but I think this is fine. What we could potentially do is (eventually) prompt as the first step during installation, and ask what language you want it to be in.

Having rustup manage everything sounds good.

Potentially, but this would make every error printing function into a macro, which isn't great. I'd rather do this as a separate crate.

1 Like

Did we ever, as a community, seriously discuss the question of whether translated compiler output is valuable to non-English speakers, and whether or not they’d want it as the default?

I think it’s obvious that translating human-written prose such as The Book and standard library documentation is genuinely valuable (separate from the question of whether we have the resources to maintain high-quality translations of them), and obvious that “translating” language syntax would be not only unhelpful but actively counterproductive, but it’s not at all obvious to me whether translating compiler output would fall into the former or the latter category. Although I’m bilingual, my native language is English, so I have no relevant lived experienced to report on this myself.

All of the past discussion I can find about translating compiler output here and on users mixes it up with some other question like translating language syntax or allowing non-ASCII identifiers. Certain posts in this thread imply that there might be a consensus that “opt-in” localized compiler output would be valuable, but that’s based on the posts of approximately 2 users. So I think we should explicitly and directly ask non-English Rust users how valuable they think translated compiler output specifically would be before we invest significant time into implementing it, especially if we haven’t localized most of our official documentation yet. Though it probably shouldn’t be as long and protracted a discussion as the one we had for non-ASCII identifiers.

11 Likes

If the goal is to make Rust more accessible to people who don’t speak English, then I think translation of libstd documentation, or even translation of the Rust Book would be more impactful.

Most of compiler error messages are so technical, that even understanding meaning of individual words is not enough to understand what the error is about, so the important part is having the prose (documentation, forums) with fuller explanations in user’s native language.

7 Likes

I looked into translation of compiler error messages last year, at a time when I was new to Rust. As of last December there were about 5700 distinct error messages that needed translation.

4 Likes

It has been asked for multiple times. This kind of thing is highly dependent on the language community; some communities prefer materials in English, others prefer localized things. The communities are not monoliths either so some people may want it and others may prefer English.

For a given language to get translations you'd need a group of people from that community who think it's important enough to do so for that language; which kind of answers the question of "do people want this" for that language.

As I mentioned, other programming tools (including gcc) do this already. We're not breaking new ground here.

There is at least one Rust book in a non-English language (Mandarin) already. Translating entire books is a much more involved proposition, I kinda prefer if people write new books from scratch for this.

Libstd could also be done, I'm focusing on things in the order of how much work they are to coordinate. We eventually want to be in a place where everything is translated, but I'd like to ratchet up slowly, rather than doing everything at once. If folks are interested in extending rustdoc to support this, I have ideas for how it should be designed so as to work with Pontoon.

We already have error codes, and the error index would also be translated.

You also have to bootstrap this stuff somewhere. There are Stack Overflows and Stack Overflow-like resources for other languages (The Rust topic on Zhihu seems rather active). As more people who are not comfortable with English start using Rust, these resources will grow too, especially if we provide them at least a partial path.

11 Likes

I would like to see syntax extensions be removed from the language one day to reduce language and compiler complexity so I think these two should be treated as being the same from a translation and diagnostics POV.


I suspect most will agree, but one thing I'd like to emphasize is that we should remain incremental about this since new error messages are added all the time and it would be unfortunate to block diagnostics improvements on having translations. This is of course far off, but a fallback approach to English if a translation doesn't exist is in my view serviceable.


I do agree with not basing the language selection on the OS default. Using rustup to set the choice as illustrated by @Manishearth seems quite good.

5 Likes

I just looked briefly at fluent. It wasn’t obvious that it facilitates translation of format_args!(…) macro invocations, which is what will be required to localize rustc compiler messages, since many languages will require reordering, and perhaps even repetition, of some of those arguments.

2 Likes

Right, this is indeed the plan. If a translation cannot be resolved, it will fall back to English. There's a bunch of best practices around doing all of this that we intend to follow. The website does something similar -- website developers are not required to translate everything, they are just required to create translation strings (i.e. create an id and stick it in the FTL file), and the translators will automatically see that there's a new string to be dealt with on Pontoon.

Sure it does, it supports variables, including pluralization. We would have a new macro -- not format_args! -- but this is all supported. We use variables on the website already.

Fluent has been designed by internationalization experts. I've got a decent level of understanding of how i18n works and think it supports everything we need (except number formatting in the Rust version, which is planned. we can do without it for now)

4 Likes

IMO it would be premature to start translating now simply because the current diagnostic handling is not amenable to it. I also feel that focusing on localizing the documentation at this time might be a more rewarding effort.

Regarding parse errors not having codes, it is tricky because the benefit of the codes is being able to provide a description of the problem and some help. The parser can potentially find any kind of error, so it is impossible to have meaningful code descriptions for the general cases. That being said, every error that the parse is able to recover from should have an error code.

3 Likes

That explains the expected approach. Thanks. FYI, there are over 10k format strings in rustc which are likely to require conversion to the new macro.

A quick grep shows less than 2.5k: A lot of these are in tests, tools, and the stdlib. There will of course be other strings which don't go through format. The migration can be done incrementally.

The website has around 500 strings, and half of these are prose, and we had our first full translation ready in about a week.

This work is quite parallelizeable.

What do you mean? To be clear, I'm suggesting we start adding the framework first, and once we've got everything converted to translation strings, we then solicit translation teams. The website did the same thing, I and some others converted it to strings first and then we invited teams to translate them.

2 Likes

A year ago I grep'd all of the different macros that generate fmt::Arguments. Just searching for format_args! doesn't suffice.

5 Likes

I separate these two since syntax extensions is able to use rustc's unstable diagnostics API, while proc macros need to go through the library design for eventual stabilization. If you could remove syntax extensions before this topic is RFC-accepted, I agree we could treat them the same :stuck_out_tongue: . Meanwhile let me add a comment to issue #54140...