Pre-RFC : Localization for rustdoc


#1

There has already been a pre-RFC on the subject, but I was not convinced by the idea of adding attributes on the source side, and it was not tackling what is, in my opinion, the main problem with translations : keep them up to date. So i tried to sum up my ideas and wrote this. I’d like to get comment, before making a real RFC.

Summary

[summary]: #summary

This RFC describe improvements to the rustdoc tool, to make it able generate documentation for multiple languages, warn about outdated translations and assist translators to keep their work up to date.

Motivation

[motivation]: #motivation

Having a documentation in his native language is essential if you don’t speak English, and still enjoyable if you do. A huge part for the Rust documentation is inside the source code as documentation comments, and generated by the rustdoc tool. So this tool need to be improved to handle multiple languages.

A common problem with translated documentations is that they are risky to use because they are often outdated. The Rustdoc tool should prevent this by warning both the reader of a documentation and the translators, if some part of the original documentation has be modified after the translation.

The objectives of the Rustdoc tool improvement are :

  • Make the localization of documentation comments fully transparent :
    • to the developers : adding, removing or modifing a locale has no impact on the source code.
    • to other translators : adding, removing or modifing a locale has no impact on other locales.
  • Warn about outdated or missing translations when the documentation is generated
  • The translated documentation must contain a warning on items with an outdated translation.
  • A translator (the previous one or a new one) must be able to easily spot the modified parts since the last time the translation was up to date.

Detailed design

[design]: #detailed-design

Localization directory

All the localization informations will be in a directory passed to rustdoc via the --l10n-path DIR parameter. This directory will contain a subdirectory for each languages. These directories will contain a localization file for each module with documented items, the structure is the same as in the source.

For example, the localization directory of a library, localized in French and Spanish, with a single module named “my_mod”, should look like this :

localization
+-fr_FR
| +-my_mod
| | +-mod.loc
| +-lib.loc
+-es_ES
  +-my_mod
  | +-mod.loc 
  +-lib.loc

Localization files

syntax

The content of the “.loc” files is the same than the one of the matching “.rs” file except :

  • there is no documentation comments on items but attributes :
    • #[doc_main] contains a copy of the item documentation from the source, at very moment of the translation
    • #[doc_translation] contains the translation of the item documentation
    • #[doc_new] generated if the doc_main is outdated or missing(see below), it contains the current item documentation from the source.
  • the crate can have a #[translator] attribute, listing the name and the email of the translators.
  • the undocumented items are not necessary
  • the body of the functions are empty.

For example a “lib.loc” might look like this :

#![translator="John Doe<john.doe@domain.net>"]

#[doc_main="The main struct of the library, it is very important"]
#[doc_translation="La struct principale de la bibliothèque, Elle est très importante"]
struct MainStruct {
}

impl MainStruct {
  #[doc_main="Do something interesting"]
  #[doc_translation="Fait quelquechose d'intéressant"]
  fn do_something(&self) {}
}

localization file generation

For big projects, it would be very complex to create manually all the localization files with all the documented items and with all the #[doc_main] attributes matching exactly the documentation comment from the source. To help the translator to produce the localization files and keep them up to date, the rustdoc tool will be able to generate all these files.

When you pass the --l10n-generate LANG parameter to rustdoc, it will (re)generate localization files for the specified language:

  • If the language subdirectory does not exist yet in the localization directory, it is created
  • For each module containing documented items in the source code, a “.loc” file is created if it not exists already.
  • For each item documented in the source code, rustdoc will check the matching item in the module localization file:
    • If the item does not exist in the localization file
      • Rustdoc display a warning at command line
      • The item is created on the localization file with a #[doc_new] attribute containing a copy of the item documentation from the source and an empty #[doc_translation] attribute.
    • If the item exists and the #[doc_main] contain the same text as the documentation in the source, nothing is done.
    • If the item exists and the #[doc_main] is different from the documentation in the source :
      • Rustdoc display a warning at command line
      • A #[doc_new] attribute containing a copy of the documentation from the source is added to the item in localization file.
      • The #[doc_main] and #[doc_translation] attributes are unchanged.

So the translator will have to update the #[doc_translation]. When he has finished, he can simply rename the #[doc_new] to #[doc_main] and delete the old #[doc_main]. To be sure he does not forget to translate anything, he can run the generator again until he has no warning.

Localized documentation generation

When a localization directory is specified, rustdoc will generate, by default, the documentation for the main language and all the languages available. The --language LANG parameter allow to generate the documentation only for the specified language.

For every localized documentation to generate, rustdoc load the source code and the localization file. For every documented item in the source, it compare the documentation in the source with the #[doc_main] attribute in the localization file :

  • If the item does not exist in the localization file
  • There is a warning on the command line, suggesting to contact the translators specified in #[translator]
  • The documentation comment from the source is used
  • If they match, then the value of #[doc_translation] is used for the translated documentation
  • If they don’t match, then :
  • There is a warning on the command line, suggesting to contact the translators specified in #[translator]
  • The translated documentation of the item will contain an alert with a link to the main language
  • The value of #[doc_translation] is used in the translated documentation (after the warning)

Drawbacks

[drawbacks]: #drawbacks

  • Add complexity to rustdoc
  • The #[rust_main] attribute may make the localization files look verbose.
  • the attribute syntax is not as straightforward to use than documentation comments (need to escape backslashes).

Alternatives

[alternatives]: #alternatives

  • Rely on a Version Control System to spot changes, but it probably can’t integrate so smoothly with the rustdoc tool, or it will make it dependent to a specific VCS.
  • Use a hash instead of the full text for the #[rust_main] and #[rust_new]attributes. It is less verbose, but the translator loose the ability to compare the attribute to spot the changes, having the original text just above while writing the translation can be useful too.
  • Use a more comment like syntax instead of attributes /**main* ... */ and /**translation* ... */ for instance.

Unresolved questions

[unresolved]: #unresolved-questions

What to do with markdown files.


#2

Making editing the .loc files reasonably convenient is not a goal? For one, I’m sure everyone would hate having to deal with #[doc_translation=""].

[quote=“Uther, post:1, topic:3190”] These directories will contain a localization file for each module with documented items, the structure is the same as in the source. [/quote]Is it the logical module structure or 1:1 dir/foo.rs:dir/foo.loc correlation?

[quote=“Uther, post:1, topic:3190”] So the translator will only have to update the #[doc_translation] and, when he has finished, rename the #[doc_new] to #[doc_main]. [/quote]And also delete the old #[doc_main]?

You spell L10n incorrectly :wink:


Here’s a slightly different idea: instead of adding #[doc_new] update #[doc_main] to the current source text immediately and add #[doc_diff] (a diff between old and new #[doc_main]). The translator will then see the current version and the actual changes. After they’re done translating, they remove the appropriate #[doc_diff]. The presence of #[doc_diff] causes the same warnings as your #[doc_new].


#3

[quote=“gkoz, post:2, topic:3190”] Making editing the .loc files reasonably convenient is not a goal? For one, I’m sure everyone would hate having to deal with #[doc_translation=""]. [/quote]Since the empty attribute is generated by the tool and Rust support multi-line strings and even raw string, I don’t think dealing with attributes would be a pain. Maybe the localization file generator should use raw strings.

I agree that a comment like syntax would be closer to what is really in source files. This is why i suggest that in the alternatives.

The advantage of the attribute syntax is that it already exist in Rust. The localization file has a Rust syntax so Rustdoc should be able to handle them out of the box. I did not watch deeply into the code yet, but handling a comment like syntax would probably need to modify libsyntax. I’m not sure if it is a good idea to make libsyntax handle a syntax specific to localization files.

[quote=“gkoz, post:2, topic:3190”] Is it the logical module structure or 1:1 dir/foo.rs:dir/foo.loc correlation? [/quote]I think that 1:1 correlation would be better, but only keeping logical module structure might be easier to implement. I’m not sure 1:1 correlation is that important.

[quote=“gkoz, post:2, topic:3190”] Here’s a slightly different idea: instead of adding #[doc_new] update #[doc_main] to the current source text immediately and add #[doc_diff] (a diff between old and new #[doc_main]) [/quote]I forgot to add it, but I thought about adding a #[doc_diff] (just as a tip). I wanted to keep it in the alternatives since it would need to add a “diff” capability to rustdoc.

But using it instead of #[doc_new] seem a good idea. Rustdoc should still consider the item outdated until the #[doc_diff] is removed. And the #[doc_new] should still be used for new items.

[quote=“gkoz, post:2, topic:3190”] And also delete the old #[doc_main]?

You spell L10n incorrectly :wink: [/quote]Fixed, thanks.


#4

Alternate proposal

There is one tested and tried approach to translating documentation. A tool takes the documentation, extracts just the text, chops it up to paragraphs and puts it into a translation template, in PO or XLIFF format, for translators to translate. Then the tool takes the documentation again and replaces the strings with what it finds in the translation catalogue while preserving structure and most formatting.

Advantages

  • Tools already exist for working with common formats: po4a and translate toolkit. HTML, OpenDocument, even some Wiki formats are already supported. I have not seen markdown, but adding it, or other format, should not be hard.

  • There are many tools for editing such translations, both web (e.g. weblate, pootle, …) and applications (e.g. poedit, virtaal, lokalize, …). These tools are easy to use, making it easier for non-programmers and beginners to contribute, and translators are already experienced with them.

  • Changes are detected reliably. A changed paragraph comes up as untranslated. If the change is small, fuzzy matching is used to show the previous text and translation to the translator, so it can be easily updated.

  • When the documentation is restructured, translations will be automatically reused for any paragraph that moved around, but didn’t change (with fuzzy matching when it only changed a little).

  • The approach is used for long time by KDE project for documentation and by Debian project for package descriptions and configuration questions (debconf templates) and works well for both projects.

  • Almost no complexity would have to be added to rustdoc itself. I suspect just calling translate toolkit’s html2po and po2html on the generated HTML would make a usable first prototype. Doing it on some intermediate format, or even on the sources themselves, would probably make more sense though, especially by not leaking HTML markup to the translator (comes up in local emphasis and verbatim quotes and such).

Non-problems

It would seem that the translator will lack context, but:

  • The translation units are generated in document order and all the editors respect the order of units, so the translator will still read the text in the document order.

  • The context is not that much needed in practice. Usually all the translator needs is a glossary of technical terms, so he translates them consistently; and the PO and XLIFF editors already have support for that.

  • Professional translators often split even down to sentences for more reuse. Splitting to paragraphs is rather conservative. See the FAQ in po4a manpage.

Drawbacks

  • Simple implementation would not be done in rust, but either in python (if using translate toolkit) or perl (if using po4a).

Technical notes

When I’ve tried on HTML some time ago, I had better results with translate toolkit’s html2po and po2html than with po4a. I think translate toolkit would also be easier to integrate, since it is just simple scripts to extract and replace the strings, while po4a handles finding files and checking for changes and such, which requires extra configuration and rustdoc will probably want to take care of that anyway.


#5

There is more drawbacks :

  • Need for manual action to rebuild the localised documentation
  • Need for translation specific tooling to regenerate documentation

With the system I suggest the documentation for locales(with warnings on outdated parts) is rebuilt automatically when you build the master documentation, without requiring any extra action for the developer. I’m ok with requiring tools for translators, but if maintaining a documentation need extra-care for the developers, they may forget (or procrastinate) to update them so the translated documentation go out of date without notice.

I agreed I did not document myself enought about existing translation tools before writing this pre-RFC. For translators, having the ability to use existing translation tools helps. I read the documentation about “.po” and it seems that supporting translation tools would not change much to what I am proposing. You need a tool to generate them from data extracted from the the master, and since the purpose of rustdoc is extracting the documentation from the source, it is the best tool to do that. It would be just need to generate “.po” files instead of the “.loc” format I proposed. The example file in my proposal would turn to :

msgid ""
msgstr ""
"Project-Id-Version: crate_name\n"
"PO-Revision-Date: 2016-03-04 15:13+0200\n"
"Last-Translator: John Doe <john.doe@domain.net>\n"

msgctx "struct Mainstruct"
msgid "The main struct of the library, it is very important"
msgstr "La struct principale de la bibliothèque, Elle est très importante"

msgctx "impl Mainstruct > fn do_something"
msgid "Do something interesting"
msgstr "Fait quelquechose d'intéressant"

And on updated texts, we would get:

#, fuzzy
#| msgid "Old main language text"
msgctx "impl Mainstruct > fn do_something"
msgid "New language text"
msgstr "Translation"

#6

I agree it should eventually be integrated into rustdoc to automate it for the authors.

How much it implements directly or calls out to translate toolkit is then an implementation detail. Given that I didn’t find a markdown splitter in either and that rustdoc already parses markdown, using the translate toolkit probably won’t save all that much work after all.

What I think should be taken from the existing tools is the idea of splitting to paragraphs or blocks. It can be applied to both long documentation blocks and separate markdown chapters (that answers what to do with markdown files). Every block that gets a block-level markup in the output should be a separate unit, so paragraph, heading, list item and block-quote should all be separate units. They should be put into the po file without the block-level format (so that the tool ensures heading is translated to heading, list item to list item etc.) and whitespace-normalized, so that reformatting that does not affect the output won’t break the translation.


#7

My pre-proposal volontary did not take on board markdown files, but I agree splitting in paragraphs seems the way to go.