Internationalization of crate metadata?

I agree entirely that we should support internationalization.

Going back to the question of how: I do not think we should do this in an ad-hoc way, by adding translated fields and files to the original crate's metadata or repository. I think it makes much more sense to integrate with tools that translators already use and support, and in particular those that track the original string along with the translation to detect if the translation is out of date. Then, we can add support in our tools to integrate such translations, such as displaying them on crates.io on request, or in cargo search, or in rustdoc / docs.rs. (This also gives users the option of seeing partial translations containing only the items that are up-to-date, if they want that.)

This will also help ensure consistency in things that need to be translated in a consistent way across packages, such as crates.io categories and probably keywords.

All of this should not involve modifying the original crate. Updating translations should not require updating the crate and bumping the version; translations should be able to cover previously released versions as well.

A partial list of things that should be translatable:

  • Crate descriptions
  • Crate README files
  • Crate categories (globally on crates.io, not on a crate-by-crate basis)
  • Crate keywords (globally coordinated, though this may still require crate-by-crate translations, due to the possibility of either homographs in the original crate language, or distinct words in the translation target language for nuances the original crate language doesn't have)
  • rustdoc documentation. This could include doctests or code examples, but ideally, there should be some automatic verification that the code is semantically identical and is only translating identifiers, comments, etc.
  • Other documentation for Rust and Cargo (beyond the standard library documentation, which would be covered by rustdoc translation)
  • Contribution documentation such as the rustc dev guide. (However, given that PRs, RFCs, code comments, documentation, tests, commit messages, and similar would still need to be contributed in English, we should find out whether there are people who would benefit from this and want to see it.)
12 Likes

Oh, I like the idea of providing translations externally!

My initial thought was some TOML syntax tweak like description[ja_JP] = "これはさびです", but a tool that allows translating other people's crates by volunteers would make that much more accessible to people who can provide translations, and not burden crate authors with knowing all the languages in the world.

4 Likes

An interesting library to look at here is Sinatra: https://github.com/sinatra/sinatra, which translates its README (and main docs) in to many languages.

In general, Ruby and the Ruby community is interesting to inspect, as theres substantial writing (like the Ruby hackers guide) is in Japanese and the English translations of the docs are more prone to lagging. https://ruby-hacking-guide.github.io/

8 Likes

فيما يخص الاقتراح الأول، إضافة رموز اللغات إلى أسماء الملفات، فلقد تم العمل به في كثير من المواقع (سفنكس Sphinx). منتِجوا المواقع الثابتة يستعملونه في مختلف الحالات. (هوڭو Hugo ||جيكل Jekyll). ڭت تيكست و سفنكس يشكلان نظاما معقدا و يتطلب مجهودا لفهمه. إضافة مزيد من اللغات إلى الملف للقراءة (README) قد تسهل هذه العملية.

إذا أخذنا [الملف الأصلي]، فلنقسمه إلى نصفين، [إسم الملف] + [الملحق] ([FILENAME + EXTENSION]). نضيف الملحق بعد النقطة (.) الأخيرة، لنتولى بأسماء تحتوي على عدة نقط. ينتمي [رمز اللغة] إلى واحد من الرموز الاعتيادية (en, fr, ar, jp ،إلخ).

نبدأ بالبحث عن [إسم الملف] . [رمز اللغة] . [الملحق] ([EXTENSION] + [LANGUAGE CODE] + [FILENAME]). و يُستعمل إذا وُجد. إذا لم يوجد، فنستعمل [الاسم الأصلي]

و فيما يخص (للقراءة | README)، فخيارنا الأول هو البحث عن ( للقراءة.[رمز اللغة].README.[LANGUAGE CODE].md | md )، فيمكن لمن يستعملان المنتوج أن تضيفا ( للقراءة.README.fr.md | md.fr ) للتكلف باللغة الفرنسية، أو ( للقراءة.README.ar.md | md.ar) للتكلف باللغة العربية. يمكن تطبيق رمز اللغة بموقع مثل <crates.io>.

يمكن بناء ملف TOML للمكتبة كما يلي:

[package]
name = "hello_world"
description = "hi"
[package.ar]
name = "مرحبا بالعالم"
description = "أهلا"

إذا لم يحدد رمز اللغة. يمكن استعمال الاختيار اللذي لا يحمل رمزا. مواصفة اللغة لا تستعمل أي مفتاح فرعي.

Français

La première suggestion, où l’on applique des codes aux noms de fichiers, est utilisée dans plusieurs endroits (Sphinx: https://sphinx-doc.org/en/master/usage/advanced/intl.html). Les générateurs de sites statiques utilisent beaucoup cette approche (Hugo: https://gohugo.io/content-management/multilingual/ || Jekyll: https://github.com/kurtsson/jekyll-multiple-languages-plugin).

Gettext/Sphinx est un système complexe, et n’est probablement pas pour les débutants. Par contre, des fichiers README multilingues seraient utiles.

Pour [FICHIER ORIGINAL] on brise l’entrée en deux parties, [FILENAME] + [EXTENSION]. [EXTENSION] contient uniquement le texte après le dernier ., pour gérer les fichiers avec plusieurs . Dans leur nom. [LANGUAGE CODE] est un des codes languages typiques de ISO 639-1 (en, fr, ar, jp, etc.).

  • On cherche en premier pour [NOM DE FICHIER] . [CODE DE LANGUAGE] . [EXTENSION]. Si c’est trouvé, on utilise ce fichier.
  • Sinon, on utilise [FICHIER ORIGINAL].

Pour README.md, the nom par défaut est toujours README.[CODE DE LANGUE].md. Les usagers peuvent ajouter README.fr.md pour supporter le français, ou README.ar.md pour l’arabe. Les codes de language peuvent être déterminés par des sites comme crates.io.

Pour les fichiers Crate TOML, de façon similaire:

[package]
name = "hello_world"
description = "hi"
[package.fr]
name = "bonjour_le_monde"
description = "salut"

Si le language n’est pas spécifié, on utilise la valeur par défaut spécifiée dans la configuration sans code de language. Aucun code de language n’est utilisé comme sous-clef dans la spécification.

svenska

Det första förslaget att applicera språkkoder i filnamn används av många (Sphinx: https://www.sphinx-doc.org/en/master/usage/advanced/intl.html). Statiska sidgeneratorer använder också ofta denna metod (Hugo: https://gohugo.io/content-management/multilingual/ || Jekyll: https://github.com/kurtsson/jekyll-multiple-languages-plugin).

Gettext/Sphinx är ett mycket invecklat system och är kanske inte det bästa att börja med, men stöd för flerspråkiga README filer skulle hjälpa.

För [URSPRUNGLIGT FILNAMN] så delas det upp i två bitar, [FILNAMN] + [FILÄNDELSE]. [FILÄNDELSE] är bara texten efter den sista . i filnamnet för att kunna hantera flera . i filnamnet. [SPRÅKKOD] är en av de vanliga språkkoderna (en, fr, ar, jp, etc.).

  • Börja med att leta efter [FILNAMN] . [SPRÅKKOD] . [FILÄNDELSE]. Hittad? Använd filen.
  • Om inte, använd [URSPRUNGLIGT FILNAMN].

För README.md är standardinställningen att leta efter README.[SPRÅKKOD].md. Användare kan lägga till README.fr.md för att stödja franska eller README.ar.md för att stödja arabiska. Språkkod kan ställas in per websida som crates.io.

Med en Crate TOML file, som:

[package]
name = "hello_world"
description = "hi"
[package.sv]
name = "hello_world"
description = "hej"

Om ingen språkkod är specificerad används standardinställningen från den utan språkkod. Ingen språkkod används som undernyckel i specifikationen.

Nederlands

Het eerste voorstel om taalcodes toe te passen in bestandsnamen wordt op veel plaatsen al zodanig gebruikt (Sphinx: https://www.sphinx-doc.org/en/master/usage/advanced/intl.html). Statische website generatoren gebruiken dit ook alom. (Hugo: https://gohugo.io/content-management/multilingual/ || Jekyll: https://github.com/kurtsson/jekyll-multiple-languages-plugin).

Gettext/Sphinx is een erg gecompliceerd systeem en waarschijnlijk een van de lastigste om mee te starten. Het ondersteunen van meertalige READMEs zou echter behulpzaam zijn.

Breek [ORIGINELE BESTANDSNAAM] in twee delen, [BESTANDSNAAM] + [EXTENSIE]. [EXTENSIE] is alleen de tekst die na de laatste punt komt, om met bestandsnamen met meerdere punten overweg te kunnen. [TAALCODE] is een van de typische taalcodes (en, nl, ar, jp, enz.).

  • Begin door te zoeken naar [BESTANDSNAAM] . [TAALCODE] . [EXTENSIE]. Gevonden? Zoja, gebruik deze.
  • Als dat niet lukt, gebruik dan [BESTANDSNAAM].

Voor README.md zoekt de standaard naar README.[TAALCODE].md. Gebruikers kunnen een README.fr.md toevoegen voor Frans, of README.ar.md voor Arabisch. Taalcodes kunnen worden ingesteld door een website zoals crates.io.

Voor een Crate TOML bestand is het vergelijkbaar:

[package]
name = "hello_world"
description = "hi"
[package.nl]
name = "hallo_wereld"
description = "hoi piepeloi"

Als de taalcode niet opgegeven is wordt de standaardvariant gebruikt zonder taalcode. In de specificatie wordt geen taalcode gebruikt als ondersleutel.

Deutsche

Der erste Vorschlag, Sprachcodes am Ende des Dateinamens zu geben, wird bereits in sehr vielen Werkzeugen verwendet (Sphinx: https://www.sphinx-doc.org/en/master/usage/advanced/intl.html). Auch in Statischer Seitengenerierung ist es stark in Verwendung (Hugo: https://gohugo.io/content-management/multilingual/ || Jekyll: https://github.com/kurtsson/jekyll-multiple-languages-plugin).

Gettext/Sphinx ist ein sehr komplexes System und höchstwahrscheinlich nicht ideal für AnfängerInnen. Mehrsprachige README Unterstützung würde allerdings sehr hilfreich sein.

Die [ORIGINALE DATEI] wird dafür in zwei Teile geteilt, [DATEINAME] + [ERWEITERUNG]. [ERWEITERUNG] ist nur nach dem letztem ., bei Dateinamen mit mehreren .. [SPRACHCODE] ist einer der typischen Sprachcodes (en, fr, ar, jp, etc.).

  • Zuerst wird für [DATEINAME] . [SPRACHCODE] . [ERWEITERUNG] gesucht. Falls gefunden, wird die Datei verwendet.
  • Falls nicht wird [ORIGINALE DATEI] verwendet.

Für README.md wird standardmäßig immer nach README.[SPRACHCODE].md gesucht. NutzerInnen können dann README.fr.md hinzufügen um Französich zu unterstützen oder README.ar.md um Arabisch zu unterstützten. Der Sprachcode wird dann durch eine Webseite wie crates.io gewählt.

Für die Crate TOML Datei ähnlich:

[package]
name = "hello_world"
description = "hi"
[package.de]
name = "hallo_welt"
description = "hallo"

Falls der Sprachcode nicht spezifiziert ist, wird standardmäßig einfach jener ohne Sprachcode verwendet. Sprachcodes werden nicht als Subschlüssel in der Spezifikation verwendet.

English

The first suggestion of applying language codes to file names is used in a lot of places (Sphinx: https://www.sphinx-doc.org/en/master/usage/advanced/intl.html). Static site generators also use this extensively (Hugo: https://gohugo.io/content-management/multilingual/ || Jekyll: https://github.com/kurtsson/jekyll-multiple-languages-plugin).

Gettext/Sphinx is a very involved system and probably not the best to get started. But multi-lingual README support would be helpful.

For [ORIGINAL FILE] break it into 2 pieces, [FILENAME] + [EXTENSION]. [EXTENSION] is only text after very last ., to handle multi . file name. [LANGUAGE CODE] is one of typical language codes (en, fr, ar, jp, etc.).

  • Start by searching for [FILENAME] . [LANGUAGE CODE] . [EXTENSION]. Found? Use file.
  • If fail, use [ORIGINAL FILE].

For README.md, the default is always looking for README.[LANGAUGE CODE].md. Users can add README.fr.md to support French, or README.ar.md to support Arabic. The language code can be passed down through the website like crates.io.

For a Crate TOML file, similar can be done in the file:

[package]
name = "hello_world"
description = "hi"
[package.es]
name = "hola_mundo"
description = "hola"

If a language code is not specified, just use default from one with no language code. No language code is used as sub-key in the specification.


I think a good first step is supporting [FILENAMEPREFIX] . [COUNTRY CODE] _ [LANGUAGE CODE] . [EXTENSION] as a way to have certain files for a specific country/language. There are ISO standards (1, 2) for the country and language codes to make it easy for everyone to agree. This will work well for READMEs and is easy to scale for most projects.

".po" files - Gettext-style -- translation is a much bigger thing. A lot of documentation translators are used to do this thanks to FSF and Linux Foundation-style projects, which use this frequently for strings being output from various projects. I am unsure if this is the best for documentation itself, because documentation is largely prose and therefore there would be more text being replaced than just certain strings. This is why I think having Language Code / Country Code file lookup and use would be a far better alternative here, but I'd leave that to better experts. :slight_smile:


A short bit on English Supremacy in Tech: what much of this thread argued for is that we should make it as difficult as possible for anyone to interface with the start of my post. They would not want a dedicated translation feature to "keep things closer to English" or "prevent Balkanization", which is exactly what making translation non-standard and difficult would do as @Manishearth has explained so beautifully. I had thought this to be taken for granted for the Rust community, but it is entirely worth noting:

  • The first "serious" Rust Communities came from Korea and Sweden.
  • One of the most famous posts legitimizing Rust was a Korean's post on the Rust Reddit, not someone working in an English speaking country.
  • Rust's biggest users exist outside of English countries (PingCAP, Embark, Parity, etc).

Forcing people to speak English does not create magical creatures who suddenly absorb English better. It throws out good programmers willing to do hard work and don't want to put up with an ecosystem that constantly tells them to Learn English Or Piss Off. The same feeling many of you reading got from staring at the Arabic above and having to find one of the drop-downs so you could understand is the same feeling many of us have to deal with, in perpetuity, in many programming language communities. What some advocated for -- the equivalent of a "no support for translation dropdowns on this post at all" policy -- is just suppression of every other language with English Only Because There's No Support For Internationalization. It is a crabs-in-a-bucket, race-to-the-bottom style of community management and it absolutely sucks.

We can either write our READMEs in some language and have everyone grafting ad-hoc translation support (or none at all because we make it difficult and thus have people just give up on sharing in other written languages or leaving the community).
OR. We can provide standard, well-defined controls for doing a better job of serving multiple people and increasing not only Rust's but Rust's libraries/crates appeal and ability to be shared, with everyone.

I prefer -- and this community should absolutely prefer -- the latter.

15 Likes

Look guys, I don't have a rooster in this fight. I said what I have to say, which is that it's a waste of effort at best IMO. But it won't be my effort that's wasted, so have at it.

I also think that Rust should improve its tooling to support internationalization. But I don't think translations should be part of a crate (i.e. uploaded to crates.io). As others have noted, usually project authors use one language for all documentation (readme, API docs, etc.) and translations are done by sub-communities, not by core maintainers. Those translations tend to get outdated with new releases and gradually get updated post-factum.

So I think we should add a Cargo.toml attribute to indicate documentation language (e.g. doc_lang, it could default to en-us, since traditionally it's the most popular language) and attribute to indicate link to translation files (it could be path to a github folder, URL to a tarball, or something else). Translations of API docs would be stored separately as pairs of an original documentation string and its translation into target language, for example using po format. rustdoc and sites like docs.rs would be able download those files and generate documentation for requested language. If translation for a given item and language is absent (e.g. if original docs get updated without adding its translation), then doc generation will be able to insert documentation in an original language, so in the worst case scenario user will be able to copy-paste such strings into machine translation tools.

To re-iterate I think that:

  • Translation files should be kept separately from a crate to allow post-release updates.
  • rustdoc should be able to generate and consume translation files. During the generation step it should re-use old translations as much as possible. On top of those files community will be able to build services for crowdsourced translation of documentation (or we could adapt existing ones).
  • README and other "big" documents should be treated separately from the API docs, but still should not be published together with a crate.
  • We can start by adding the doc_lang attribute. Even if internationalization will not get enough traction in the near future, such attribute should be useful nevertheless.
2 Likes

How common is it in the Ruby ecosystem to have docs in this many languages?

A key concern with docs in multiple languages is locating the most up-to-date language version that one can read (or can be bothered to feed to translation software).

If the common case (after the monolingual case) ends up being two (or so) languages, the easy way to deal with this problem would be for crates.io to render the README in both languages without trying to hide either language from the reader. The simple way to get to that point right away would be to put both language versions in the same README file with div lang= as @Lonami suggested upthread (with the README author deciding the order of languages).

The README of igo-rs is a pre-existing example of this pattern. One doesn't need to be able to read Japanese to see at a glance that the English text does not cover all the same points as the Japanese text. Hiding the Japanese text from people who according to some software setting "prefer English" would be anti-useful.

1 Like

So to me the plan looks like this:

  • I can add support for displaying README.$lang.ext on https://lib.rs. I'll try making language switching based on subdomains like https://$lang.lib.rs to allow search engines to index all translated readmes. (edit: displaying both the translated README and README in the primary together on one page, so that readers have a chance to notice when the translation is out of date).

  • I'll have to add language sniffing to support existing crates without language identifiers and multi-lingual READMEs that include one language after another. Fortunately, there are Rust crates for this!

  • That's a big maybe, but maybe propose an RFC for [package] lang = "$BCP47" for Cargo to define the primary language of crate's description. It won't support mixing multiple languages in the same Cargo.toml, because it's meant to identify the one language the crate author prefers to use, and all other languages will be left for external translation.

  • Design and implement tooling for external translation of crates. It might be something that extracts Cargo.toml strings, and maybe doc comments and paragraphs of the README, and produces a XLIFF from them for translation. There are lots of XLIFF tools that hopefully will provide translation memory, UIs for contribution, etc.

7 Likes

These two constraints are at odds, or will at least need large workflow changes or additional tooling built. docs.rs downloads the source and builds the docs once for each version of a crate, soon after it is published; if the translations are not updated till later then those docs that have been built will be missing the updated translations.

Yes, the docs.rs workflow would have to change. Initially it can be a request-based system in which users would add crate versions to a rebuild queue. It could be useful not only for updating translations, but also for fixing CSS bugs. Eventually if we are to get a specialized documentation translation service, it can be extended with a webhook-like functionality.

Initially it can be a request-based system in which users would add crate versions to a rebuild queue.

This requires authentication of some kind to avoid DOS attacks, and that's something docs.rs can't do right now due to XSS vulnerabilities: design discussion: build queue overhaul · Issue #301 · rust-lang/docs.rs · GitHub. I don't have any objection to showing a translated version of the docs if available, but we should separate that from updating docs independent of crates.io releases, which would need much more fundamental changes to docs.rs.

2 Likes

If a README in $lang is incomplete or stale, how does the reader discover that more complete or more up-to-date README is available in another language that they are able to read or are willing to make the effort to machine translate?

If the igo-rs README with its current content was split across en.lib.rs and ja.lib.rs and I was looking at en.lib.rs, how would I discover that igo-rs-ruby has existed at some point?

1 Like

I think the answer here is to assume that the crate has a primary language in README.md (or if present, the package.lang tag). Then e.g. en.lib.rs/igo-rs can show a banner at the top saying roughly "This is a translated version of the readme, and may be out of date. The original readme is available at lib.rs/igo-rs." Since this is a single site-wide string, it should be relatively easier to translate it to multiple languages, and even to have a version that mentions the last time the versions were updated.

Also, nothing keeps the translated documentation itself from noting that it may be incomplete/out-of-date and linking the untranslated documentation as the source of truth. If it doesn't have enough workpower to keep itself updated, that's probably a good idea anyway. (The "sufficiently advanced" translation tool could even do so automatically, only for stuff that's actually incomplete/out-of-date.)

1 Like

Your post is potentially self-contradictory.

"English isn't my mother tongue and I still prefer doing programming in English"

This is true for many people, and it's true for me. This doesn't mean it's true for everybody, the situation is different for each language. Furthermore, often the reasons behind this feeling are because of a lack of good materials and vibrant communities for those languages: precisely the problem that internationalization helps to fix!

[…later…]

Secondly, to drill down a little bit into this: This isn't really balkanization. The people who are enabled by internationalization would otherwise likely never contribute to the ecosystem in the first place .

If (from the first quote) non-native English speakers who currently prefer programming in English "often" do so because of bad internationalization, then by definition, if we had better internationalization, some of those people would (sometimes) program in their native language instead. That would perhaps not eliminate, but would at least reduce, their contributions to the English-speaking Rust community.

I suppose you are arguing that those people are (greatly?) outnumbered by the category described in the second quote, people who need good internationalization to contribute to the ecosystem in the first place.

Having those people join non-English communities is an admirable result, but it's worth noting it's not one that reduces balkanization or helps compensate for the hypothetical no-longer-programming-in-English group. For that, you have to postulate that some of those people will eventually contribute to the English-speaking ecosystem, either directly (because they learn English some day) or indirectly (because someone else translates their code or bug reports or whatnot into English). Which is probably true, but only some.

--

That said, my uneducated guess is that the premise in my first quote is actually a bit too strong. Regardless of proposed improvements, there will still be plenty of reasons for those who can program in English to do so.

  • Even if other languages grow more "vibrant communities", the English community will still have a significant edge for the foreseeable future: certainly in expertise, probably in raw size as well (though I wouldn't be surprised if the Chinese community either is bigger or will be in the future; I have no idea).

  • At least for now, there have been no serious proposals to internationalize Rust code – that is, where language keywords and library function/type names would have multiple versions for different languages. There can only be one name, and for most of the ecosystem that name is in English. Therefore, if you write code you must use some English, so people feel they may as well use English everywhere for consistency's sake.

As for those who don't know English well enough to program in it – well, there is the question of whether or not they dedicate a substantial chunk of their life to learning it. But for the foreseeable future, influencing that one way or the other is way above Rust's pay grade.

1 Like

Correct.

Also, if there were a significant Marathi Rust community (and I was more confident with my spelling :upside_down_face:) I would likely involve myself in both.

And yes, the English community will probably have an edge -- so the people who don't need internationalization to participate will likely participate in the English community whether their specific language community exists or not. That's what I mean when I say "The people who are enabled by internationalization would otherwise likely never contribute to the ecosystem in the first place". It's not a contradiction, I'm specifically talking about the people who are enabled by internationalization. The people who don't need to be enabled will at best involve themselves in both, and at worst involve themselves in the larger community (English), which doesn't contribute to balkanization either.

2 Likes

But why does helping more people have to mean not promoting nternationalization? Roughly 20% of the world does speak English, but there's another 80% here that you decided to just convinently ignore....

Helping people contribute to the Rust ecosystem doesn't increase balkanization any more than not helping them does. For example, there's many Asian rust communities that I'm familiar with. The status quo seems pretty balkanized to me.

2 Likes

I don’t think that anybody is thinking that this would be a good idea.


However, this made me think of—something I haven’t seen addressed in this nice list

or anywhere in this thread—what about rustc’s compiler error messages? Is it a reasonable goal to have those translated eventually, too?

It's been discussed in the past. @Manishearth linked to it indirectly in his long post above, in the second-last link. See specifically Translating the compiler, which is a thread that he started in June 2019.

2 Likes

This is something I would love to tackle at some point, but it is a huge undertaking.

1 Like

Translating error messages is a huge undertaking for a few reasons:

  • Error messages are highly technical, and due to CLI formatting restrictions, highly space constrained.
  • Because rustc cares so much about good error messages, our error messages are a moving target to translate. In addition, message structure in the English language informs how we present the error (e.g. ordering of hint spans, etc.), so a fully ideal translation facility would give nearly the full context to be formatted specially for each target language.
  • And even just searchability. If you Google an English error message, you're fairly likely to find info on it (though in English, tbf). If you Google a translated version, you're much less likely to find resources explaining the error (e.g. on SO or URLO).
  • Not to mention the UX questions of choosing what language a CLI tool runs in.

That's not to say it couldn't be done and be useful, but there are extra technical problems (not just social and people power ones) that make it a harder problem. That's why I agree with the general path of translating --explain pages first, as those are both less volatile and have fewer thorny presentation issues.

5 Likes