Pre-RFC : Localization for rustdoc

Updated according to feedback and new ideas :

Summary

This pre-RFC describe improvements to the rustdoc tool, to make it able to generate documentation for multiple languages, warn about outdated translations and assist translators to keep their work up to date.

Motivation

Having a documentation in your native language is essential if you don’t speak English, and still enjoyable even if you do. But a common problem with translated documentations is that they may go outdated without notice if the translator does not review them at every release.

A huge part of the documentation of Rust projects (including the standard library) is inside the source code, as documentation comments, and generated by the rustdoc tool. This tool may be improved to handle translation and warn both the users and the translators, if some part of the original documentation has been modified after the translation occurred.

The main objectives of the suggested rustdoc tool improvement are :

  • Introduce a standard format for localization of documentation comments that feels natural for Rust developers (since translators will mostly be Rust users too)
  • The default workflow must make the translation effort fully transparent for the developers :
    • Localization has no impact on the source code.
    • Localization effort take place is in a separate directory that developers don't have to care about.
    • No additional step or special tooling is required to generate the localized documentation.
  • When the documentation is generated, warn on command line about outdated or missing translations
  • The translated documentation must contain a warning and a reference to the current original text on items with an outdated translation.
  • A translator must be able to easily spot all the outdated translations.

Guide level explanation

Translate crate documentation for a new language

If you want to provide a translation for a new language, run the command cargo doc --l10n-generate LANG where LANG is the code of the language. A localization directory localization/LANG for the specified language is automatically generated at the root of the crate directory. It will mirror the structure of the source directory, but ".rs" files are replaced with ".loc" files.

For example, the crate directory of a library localized in French and Spanish, with a single module named “my_mod”, should look like this :

+-src
| +-my_mod
| | +-mod.rs
| +-lib.rs
+-localization
  +-es_ES
  | +-src
  |   +-my_mod
  |   | +-mod.loc
  |   +-lib.loc
  +-fr_FR
    +-src
      +-my_mod
      | +-mod.loc
      +-lib.loc

The .loc files contains the declarations of the items documented in the matching .rs file. On every item, the content of the documentation have been moved into a #[doc_original = r"..."] attribute followed with an empty #[doc_translation = r""] attribute.

For example given this lib.rs file :

/// The main struct of the library
struct MainStruct {
  /// The only field of MainStruct
  field : u32,
}

impl MainStruct {
  /// Do something interesting
  fn do_something(&mut self) {
    self.field += 1;
  }
  fn undocumented_fn(&self){
    println!("Hello World !"); 
  }
}

The generated “lib.loc” would look like this :

#![translator=""]

#[doc_original=r"The main struct of the library"]
#[doc_translation=r""]
struct MainStruct {
  #[doc_original=r"The only field of MainStruct"]
  #[doc_translation=r""]
  field : u32,
}

impl MainStruct {
  #[doc_original=r"Do something interesting"]
  #[doc_translation=r""]
  fn do_something(&mut self) {}
}

Complete the #[doc_translation] attributes with the translation for your language and the #[translator] attribute with your name and address (if you want to).

Generating translated documentation

When you run the cargo doc command, the documentation for available locales will be generated along the original one. If you want to generate the documentation only for one language, use the --language LANG parameter.

To handle incomplete or outdated translation :

  • If some items documented on the source does not have a matching item in the localization files, you will get a warning on the command line and the original text will be used for these items in the generated documentation .
  • If some items in the localization files have the #[doc_original] attribute that does not match anymore with the documentation from the source, you will get a warning on the command line. In the generated documentation, there will be a warning header on the description of these items with a link to the documentation in the original language.

Fix an outdated translation

If changes happened on the source code and you want to fix the translation to match, run the command cargo doc --l10n-generate LANG where LANG is the code of the language. The content of the localization files for the specified language is automatically updated to match the new source and you will be warned on the command line about the parts of the localization files that need to be updated :

  • The declaration of new items are inserted into the ".loc" files with a #[doc_original = r"..."] containing the original documentation for the item and an empty #[doc_translation = r""] attribute.

  • For items with a modified documentation comment since the last translation, there will be a #[doc_outdated] attribute containing the documention comment at the time of the previous translation while #[doc_original] contains the new original documentation comment. The #[doc_translated]` is left unchanged

For example if the previous lib.rs file is modified to :

/// The main struct of the library
struct MainStruct {
  /// The first field of MainStruct
  field : u32,
  /// An additional field
  additional_field : u32,
}

impl MainStruct {
  /// Do something interesting
  fn do_something(&mut self) {
    self.field += 1;
  }
  /// Do something else interesting
  fn do_something_else(&self){
    self.additional_field += 1;
  }
}

The generated “lib.loc” would look like this just after the automatic generation :

#![translator="John Doe<john.doe@domain.net>"]

#[doc_original=r"The main struct of the library"]
#[doc_translation=r"La struct principale de la bibliothèque"]
struct MainStruct {
  #[doc_original=r"The first field of MainStruct"]
  #[doc_outdated=r"The only field of MainStruct"]
  #[doc_translation=r"Le seul champ de la bibliothèque"]
  field : u32,
  #[doc_original=r"An additional field"]
  #[doc_translation=r""]
  additional_field : u32,
}

impl MainStruct {
  #[doc_original=r"Do something interesting"]
  #[doc_translation=r"Fait quelquechose d'interessant"]
  fn do_something(&mut self) {}
  #[doc_original=r"Do something else interesting"]
  #[doc_translation=r""]
  fn do_something_else(&mut self) {}
}

Complete the empty #[doc_translation] attribute with the translation. Update the #[doc_translation] on Items with the #[doc_outdated] attribute. Then remove the #[doc_outdated] attribute.

Detailed design

Localization directory

Everything about localization will be in a directory passed to rustdoc via the --l10n-path DIR parameter. By default the cargo doc command will pass the localization directory at the root of the crate directory if it exist. This directory will contain a sub-directory for each language. These directories would mirror the source directory with ".loc" files instead of ".rs" files.

Localization files

Syntax

The content of the “.loc” files is the same than the one of the matching “.rs” file except :

  • Only documented item declarations are present
  • The body of the items is ignored and should be empty, unless it contains documented items.
  • There is no documentation comments on items but attributes :
    • #[doc_translation] contains the translation of the item documentation
    • #[doc_original] contains an exact copy of the item documentation from the source. It will be automatically generated.
    • #[doc_outdated] contains the item documentation from the source, at the time of the translation. It will be automatically created if the item documentation from the source has been modified (it is unchanged if already present).
  • The crate can have a #[translator] attribute, listing translators informations.

Automatic generation

It would be too complex to create manually all the localization files with all the documented items and with all the #[doc_original] attributes matching exactly the documentation comment from the source. Hopefully, the rustdoc tool will be able to generate all these files and help to keep them up to date.

When you pass the --l10n-generate LANG parameter to rustdoc, it will generate (or update) the localization files for the specified language:

  • If the language sub-directory does not exist yet in the localization directory, it is created
  • For each file containing documented items in the source code, a “.loc” file is created if it does not exists already.
  • For each item documented in the source code, rustdoc will check the matching item in the module localization file:
    • If the item does not exist in the localization file:
      • Display a warning at command line: <file>:<line> The item <item> need a translation
      • The item is created on the localization file with a #[doc_original] attribute containing a copy of the item documentation from the source and an empty #[doc_translation] attribute.
    • If the item exists and the #[doc_original] is different from the documentation in the source:
      • Display a warning at command line: <file>:<line> Translation for <item> need to be updated.
      • The #[doc_original] attribute is updated to contain the new value in the source
      • If the #[doc_outdated] attribute does not exist yet, it is created to contain the previous #[doc_original].
      • The #[doc_translation] attribute is unchanged.
    • If the item exists and the #[doc_original] contain the same text as the documentation in the source:
      • If the #[doc_translation] is empty, display a warning at command line : <file>:<line> The item <item> need a translation
      • If there is an #[doc_outdated] attribute, display a warning at command line: <file>:<line> Translation for <item> need to be updated.
      • Else do nothing.
  • The generated #[doc_original] and #[doc_outdated] attributes are using litteral raw strings with the minimum required amount of #. The #[doc_translation] will have the same amout of # in its raw string header than the #[doc_original]

The translator will have to complete the empty #[doc_translation] and update the ones with a #[doc_outdated] . When they have finished updating the translation,they will delete the #[doc_outdated]. To be sure they does not forget to translate item or remove some #[doc_outdated], they can run the generator again and fix the items until there is no more warning.

Localized documentation generation

When a localization directory is specified, rustdoc will generate, by default, the documentation for the main language and all the languages available. The --language LANG parameter allow to generate the documentation only for the specified language.

For every localized documentation to generate, rustdoc load the source code and the localization file. For every documented item in the source, it compare with the #[doc_original] attribute in the localization file :

  • If they match, then the value of #[doc_translation] is used for the translated documentation
  • If they don’t match, or if there is a #[doc_outdated] attribute :
    • The translated documentation of the item will contain an alert with a link to the main language documentation for the item
    • The value of #[doc_translation] is used in the translated documentation (after the warning)
  • If the item does not exist in the localization file or the #[doc_translation] is empty :
    • The documentation comment from the source is used

If a translation has outated or missing item, there will be a warning : The translation for <LANG> seems outdated.. Followed by you should contact <translator>, when the #[translator] attribute is filled.

Drawbacks

  • Add a lot of complexity to rustdoc
  • The #[doc_original] attribute make the localization files look verbose.
  • the attribute syntax is not as idiomatic than documentation comments.

Alternatives

Use a doc comment syntax

Even if it doc comments are converted to #[doc] attributes internally, the documentation in source files is usually done with comments. Using a syntax based on doc comments may feel more natural. The attributes would have to be replaced with some kind of tag. For instance :

///[l10n]: # (original)
/// Original documentation
///[l10n]: # (translation)
/// Translated documentation
fn do_something(&mut self) { }

Use a hash

For #[doc_original] and #[doc_outdated], we can use a hash instead of the full text. Since the translation would be the only full text, it would not require a tag with comment alternative. For instance:

///[l10n]: #original (8a5858a)
/// Translated documentation
fn do_something(&mut self) { }

It would make localization files less verbose, but the translator loose the ability to spot the original text directly in the localization file.

Use a diff in #[doc_outdated]

When the original documentation is modified, #[doc_outdated] may contain a diff between the previous original and the current one, instead of the full text. It may make the changes easier to spot in long comments, but it would introduce even more complexity into rustdoc.

Extension of localization files

At first I decided to go for the .loc extension for localization files, but since they are syntactically valid Rust files, maybe they should have the .rs or .loc.rs extension so they can be handled like Rust files by text editors.

Rely on existing translation tools

There are existing format for localization files like gettext or fluent. Rustdoc could generate gettext or fluent files instead of the proposed format.

But one of the most interesting points of these formats is handling dynamic text (plural, gender, ...). Since the doc comments are static text, using fluent or gettext would not be so useful. Moreover most of the documentation translators will be Rust developers not used to translation tools. They will probably fell more comfortable with a format that looks like a source file. This format would probably be easier for rustdoc to parse too.

Unresolved questions

markdown files

Since markdown files are not a collection of items but a whole file, it would require a different mechanism. It may be handled by paragraph.

macros

Macros can generate items with documentation. But it would probably be too complex to generate ".loc" files with macros.

Generated ".loc" files should be based on ".rs" files with expanded macros. If the translator want to use macros too in the ".loc" file, they would have to write them manually.

1 Like

Making editing the .loc files reasonably convenient is not a goal? For one, I'm sure everyone would hate having to deal with #[doc_translation=""].

Is it the logical module structure or 1:1 dir/foo.rs:dir/foo.loc correlation?

And also delete the old #[doc_main]?

You spell L10n incorrectly :wink:


Here's a slightly different idea: instead of adding #[doc_new] update #[doc_main] to the current source text immediately and add #[doc_diff] (a diff between old and new #[doc_main]). The translator will then see the current version and the actual changes. After they're done translating, they remove the appropriate #[doc_diff]. The presence of #[doc_diff] causes the same warnings as your #[doc_new].

Since the empty attribute is generated by the tool and Rust support multi-line strings and even raw string, I don't think dealing with attributes would be a pain. Maybe the localization file generator should use raw strings.

I agree that a comment like syntax would be closer to what is really in source files. This is why i suggest that in the alternatives.

The advantage of the attribute syntax is that it already exist in Rust. The localization file has a Rust syntax so Rustdoc should be able to handle them out of the box. I did not watch deeply into the code yet, but handling a comment like syntax would probably need to modify libsyntax. I'm not sure if it is a good idea to make libsyntax handle a syntax specific to localization files.

I think that 1:1 correlation would be better, but only keeping logical module structure might be easier to implement. I'm not sure 1:1 correlation is that important.

I forgot to add it, but I thought about adding a #[doc_diff] (just as a tip). I wanted to keep it in the alternatives since it would need to add a "diff" capability to rustdoc.

But using it instead of #[doc_new] seem a good idea. Rustdoc should still consider the item outdated until the #[doc_diff] is removed. And the #[doc_new] should still be used for new items.

Fixed, thanks.

Alternate proposal

There is one tested and tried approach to translating documentation. A tool takes the documentation, extracts just the text, chops it up to paragraphs and puts it into a translation template, in PO or XLIFF format, for translators to translate. Then the tool takes the documentation again and replaces the strings with what it finds in the translation catalogue while preserving structure and most formatting.

Advantages

  • Tools already exist for working with common formats: po4a and translate toolkit. HTML, OpenDocument, even some Wiki formats are already supported. I have not seen markdown, but adding it, or other format, should not be hard.

  • There are many tools for editing such translations, both web (e.g. weblate, pootle, …) and applications (e.g. poedit, virtaal, lokalize, …). These tools are easy to use, making it easier for non-programmers and beginners to contribute, and translators are already experienced with them.

  • Changes are detected reliably. A changed paragraph comes up as untranslated. If the change is small, fuzzy matching is used to show the previous text and translation to the translator, so it can be easily updated.

  • When the documentation is restructured, translations will be automatically reused for any paragraph that moved around, but didn’t change (with fuzzy matching when it only changed a little).

  • The approach is used for long time by KDE project for documentation and by Debian project for package descriptions and configuration questions (debconf templates) and works well for both projects.

  • Almost no complexity would have to be added to rustdoc itself. I suspect just calling translate toolkit’s html2po and po2html on the generated HTML would make a usable first prototype. Doing it on some intermediate format, or even on the sources themselves, would probably make more sense though, especially by not leaking HTML markup to the translator (comes up in local emphasis and verbatim quotes and such).

Non-problems

It would seem that the translator will lack context, but:

  • The translation units are generated in document order and all the editors respect the order of units, so the translator will still read the text in the document order.

  • The context is not that much needed in practice. Usually all the translator needs is a glossary of technical terms, so he translates them consistently; and the PO and XLIFF editors already have support for that.

  • Professional translators often split even down to sentences for more reuse. Splitting to paragraphs is rather conservative. See the FAQ in po4a manpage.

Drawbacks

  • Simple implementation would not be done in rust, but either in python (if using translate toolkit) or perl (if using po4a).

Technical notes

When I’ve tried on HTML some time ago, I had better results with translate toolkit’s html2po and po2html than with po4a. I think translate toolkit would also be easier to integrate, since it is just simple scripts to extract and replace the strings, while po4a handles finding files and checking for changes and such, which requires extra configuration and rustdoc will probably want to take care of that anyway.

4 Likes

There are more drawbacks :

  • Need for manual action to rebuild the localized documentation
  • Need for translation specific tooling to regenerate documentation

With the system I suggest, the documentation for locales is rebuilt automatically when you build the master documentation, without requiring any extra action or tools. I'm ok with requiring tools for translators, but if maintaining a documentation need extra-care for the developers, it will sooner or later go out of date without notice.

I agree I did not document myself enough about existing translation tools before writing this pre-RFC. For translators, having the ability to use existing translation tools might be useful. I read the documentation about ".po" and it seems that supporting translation tools would not change much what I am proposing. You need a tool to generate them from data extracted from the the master, and since the purpose of rustdoc is extracting the documentation from the source, it is the best tool to do that. It would be just need to generate ".po" files instead of the ".loc" format I proposed. The example file in my proposal would turn to :

msgid ""
msgstr ""
"Project-Id-Version: crate_name\n"
"PO-Revision-Date: 2016-03-04 15:13+0200\n"
"Last-Translator: John Doe <john.doe@domain.net>\n"

msgctx "struct Mainstruct"
msgid "The main struct of the library, it is very important"
msgstr "La struct principale de la bibliothèque, Elle est très importante"

msgctx "impl Mainstruct > fn do_something"
msgid "Do something interesting"
msgstr "Fait quelquechose d'intéressant"

And on updated texts, we would get:

#, fuzzy
#| msgid "Old main language text"
msgctx "impl Mainstruct > fn do_something"
msgid "New language text"
msgstr "Translation"

I agree it should eventually be integrated into rustdoc to automate it for the authors.

How much it implements directly or calls out to translate toolkit is then an implementation detail. Given that I didn’t find a markdown splitter in either and that rustdoc already parses markdown, using the translate toolkit probably won’t save all that much work after all.

What I think should be taken from the existing tools is the idea of splitting to paragraphs or blocks. It can be applied to both long documentation blocks and separate markdown chapters (that answers what to do with markdown files). Every block that gets a block-level markup in the output should be a separate unit, so paragraph, heading, list item and block-quote should all be separate units. They should be put into the po file without the block-level format (so that the tool ensures heading is translated to heading, list item to list item etc.) and whitespace-normalized, so that reformatting that does not affect the output won’t break the translation.

1 Like

My pre-proposal volontary did not take on board markdown files, but I agree splitting in paragraphs seems the way to go.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.