Translating the stdlib docs

See also: Translating the compiler

Organized translation efforts for Rust are now starting up. The website is in the process of being translated (Turkish and Simplified Chinese are ready).

This thread is in a similar vein as Translating the compiler, as an attempt to figure out how best to structure translation efforts for the stdlib docs. The goal here is to settle on an implementable plan for producing and maintaining translated versions of the stdlib docs at https://doc.rust-lang.org/std/.


My plan here is similar to what I hope to do for the error index. These docs are all blocks of text, there’s no interpolation involved. With that in mind, we actually should be picking a simpler localization format. For example, Fluent is great but it has a syntax to it, and that syntax will conflict with Rust code (for example, Fluent will pick up anything between curly braces).

What’s important here is that the format – whatever we pick – is supported by Pontoon and won’t have any syntax that conflicts with Rust code. These files will never be manually written so it’s not that big a deal what the format precisely is. We may end up diffing things, though.

I’m leaning towards picking the JSON format used by WebExtensions, since the only syntax it supports is $foo$, which is unlikely to crop up. We can also create our own format and make it work with Pontoon.

I currently think we should design translation support as follows:

  • Teach Rustdoc to dump every rendered doc block from a given crate into a json file, potentially split up by module. We’d have some way of converting a given path into a valid json identifier.
  • Teach rustdoc to consume a folder full of such files and replace all doc comments based on this file
  • Create a separate repo for translations. Periodically update the en-US folder in that from the JSON dumped from a rustdoc run on master. It needs to be a separate repo since Pontoon will be committing directly to it.
  • Hook this repo up to Rust’s Pontoon
  • Perhaps make this repo a periodically-bumped submodule in rustc so that stable/beta docs can still be built.
  • Every official doc build, build the localized docs for each supported language by pulling in this repo. It may be worth making this a nightly-only task triggered after publication of the main nightlies
  • Hook this up into whatever distribution system we use for compiler language packs so that rustup component install rust-docs can work with this.

Questions:

  • Do we have the bandwidth to implement this?
  • The relationship between the translated docs repo and rustc is somewhat cyclic, submodules may not be the right answer here.
  • How exactly should the deploy process work? (cc @pietroalbini)

There are also questions about how best to coordinate this work (there are some challenges the website doesn’t have), however those are common for all

cc @GuillaumeGomez @QuietMisdreavus @skade @sebasmagri

3 Likes

I have had some thoughts about leveraging something like doc(include) to create a system for translations. Point it at a base directory and provide a locale key to source docs from, and make rustdoc load those docs in, just like for doc(include). It shouldn’t be too intense to build a system like this into rustdoc, but implementing it in the standard library will require a lot of tedious legwork.

One wrinkle is that this statement isn’t 100% correct:

The docs for the integer/float primitives and the atomic/nonzero integers are interpolated by macro, to insert relevant values and type names into their doctests. Any extraction that occurs needs to be able to take this into account.

2 Likes

One goal behind this plan is to decouple this work from the plates of people working on the stdlib, which is why once the infrastructure is set up stdlib maintainers don’t need to keep managing these files, instead it’s all automated.

1 Like

One advantage of the doc(include) approach is that we can then VCS things as markdown files, which is nice for keeping track of things. We could autogenerate this stuff too. This is also just a lot of files, though: with the JSON option we can choose how granular our files are (I’m thinking at most one per type, probably one per module)

Yes, please. Translating the stdlib is definitely helpful. Furthermore, doing the legwork so all other crates documents can be gradually translated by the community is heroic. Can’t wait to see docs.rs displaying localized docs for every crate!

Yeah, submodules are currently the way to go for external resources used by rustc. Updating it as the other submodules shouldn’t be an issue.

It would be best to build the translated docs alongside the English ones on CI. Our current release tooling is basically “sign and copy the rust-lang-ci2 artifacts to static-rust-lang-org”, and implementing what you suggested would be a big infra change. We also can’t add or change stuff in a nightly after release, since that’d invalidate all signatures.

I don’t recommend including all the translations in the rust-docs component. Unpacking that component on Windows is already a pain due to the huge amount of files and some Windows corner-cases, and bundling translations would increase its size a lot. What we can do is produce a rustup component for each language (like rust-docs-it for Italian) and have rustup download the right one alongside English based on the language you configured.

2 Likes

No, I mean that we’d make this part of rustup language install fr (or whatever distribution system we make for this): If you’ve installed a language, rust-docs gives you those docs for free.

My concern is that rustdoc runs aren’t that cheap. We can try, though.

1 Like

Heh, I know, we already disabled docs for non-tier-1 platforms since they were expensive to build. I’m not sure we can get around that if we want to bring stdlib translations in though. Raising this in today’s infra meeting.

Also, what about code examples in the docs? If they’re included in the Pontoon strings we’ll want to test them as part of our CI to ensure they actually compile, and possibly setup toolstate to disable (maybe?) the docs when their compilation fails.

We discussed this during today’s infra team meeting, and this is the consensus we reached (only for the infra side) on the proposal:

If we rearchitect parts of our CI to build all the docs only one time in a dedicated builders we have enough capacity to build localized stdlib docs for multiple languages :tada:

The approach we recommend to include the translations in the build is to put the Pontoon repository as a submodule of rustc, and use toolstate to manage it the same way we currently do with Clippy, rustfmt and RLS. Depending on what we decide is a “green” build updating the submodule and breakages will be more rare or frequent.

2 Likes

Maybe except for Fluent, the format and the syntax of interpolations are mostly unrelated. XLIFF is used with many different frameworks that each use different syntax inside the strings and in PO the placeholders are different depending on programming language even if it’s still gettext tooling.

There is Translate Toolkit (a Python library and command-line utility) for converting about any format to any other format if needed.

This would probably be easier using some format that is English-to-translation rather than ID-based (best known is PO, XLIFF can also be used that way). Debian uses po4a for automatically extracting texts for translation this way and Translate Toolkit can do it for HTML and some other formats too (last I checked, Translate Toolkit worked better for HTML, and it is probably easier to use for supporting new format too).

This has two advantages:

  • No change to the documentation format is required, so no extra work for the library maintainers and almost no change to how things are done (there may be a couple of things to avoid, but not much).
  • The tooling makes sure that when the text changes, it will come up as needing (re-)translating, automatically, without risk of forgetting. The Gettext and Translate Toolkit tools support fuzzy-matching so whole paragraph does not have to be translated again if it just changes a little, but somebody gets to see the translation before it is used.

Hm, Translate Toolkit can already take HTML apart by paragraph, put it for translation, and assemble it again. Maybe with a bit of customization (to skip parts that are not for translation like the symbol names and item signatures) it would be possible to simply translate the output of rustdoc, so it wouldn’t need any special support in rustdoc itself.

It would see the macro expansions mentioned above already expanded, but then those were in doctests and that is code that shouldn’t be translated anyway (translating comments in examples and doctests can be left for a later revision).


Side-note: Are the docs platform-independent in face of #[cfg(whatever)] platform/target/system-specific symbols?

1 Like

Overall my impression of the situation is that English-to-string translation systems aren’t as preferred these days, especially for larger systems. In this case, we’re translating prose, so it’s really not well suited.

No change to the documentation format is required, so no extra work for the library maintainers and almost no change to how things are done (there may be a couple of things to avoid, but not much).

The proposed system here also has this benefit. Changes to the pipeline and tooling are necessary, but that’s true in any system. The doc format stays the same.

Hm, Translate Toolkit can already take HTML apart by paragraph, put it for translation, and assemble it again.

We use Markdown, though, I’d rather not split up the generated HTML output; it’s nicer if the translators have the full ability to translate within markdown.

Side-note: Are the docs platform-independent in face of #[cfg(whatever)] platform/target/system-specific symbols?

There’s a special doc(cfg) thing that makes this work. It can be dealt with.

1 Like