[pre-RFC] Deprecate and replace CStr/CString

Here are likely explanations why this hasn't blown up:

  • As BurntSushi points out, Rust's CString doesn't actually assume anything about the encoding (as it mostly deals with [u8] and Vec<u8>). It only makes it easy to convert to Rust strings when you can assume UTF-8.

  • On Windows, as you note, UTF-8 cannot be set as the encoding used by system APIs that take strings consisting of 8-bit code units, so using CString with Windows system APIs is wrong (except when working around specific bugs in Windows) and, instead, on Windows, system APIs that take UTF-16 strings should be (and are) used instead.

  • On non-Windows platforms, the reason why Rust code needs to call into libc is to perform system calls. The part of libc that wraps system calls does not perform locale-aware text processing and is, therefore, oblivious to the codeset of strings. (The wide-character IO part of libc should never be used.) The part of libc that is locale-aware performs userland operations, and those operations should be implemented in Rust instead of being delegated to the unreasonably-designed C standard library functions. (C programs shouldn't call those functions, either, to escape the fundamental misdesign of the C standard library!)

  • The bogus i18n approach of the C standard library where the interpretation of strings in APIs depends on the configuration environment does not apply to all C libraries. As a particular prominent example, in all GLib-based code (Gnome, etc., including Cairo that you mention) strings are always UTF-8 (except for filenames which are opaque bytes unless displayed to the user and which are interpreted as UTF-8 for display purposes by default; to override the default display interpretation, you need to set the G_BROKEN_FILENAMES environment variable, which indicates clueful attitude towards these issues).

  • On macOS, iOS, Android, Sailfish and OpenBSD the string encoding in system C string text APIs is UTF-8 (though OpenBSD also supports the POSIX C locale that makes the encoding US-ASCII).

  • Red Hat has defaulted to UTF-8 since 2002, SuSE since 2004 and Debian since 2007 (the most prominent Debian derivative, Ubuntu, defaulted to UTF-8 before Debian itself).

  • Solaris has at least supported UTF-8 locales since Solaris 8. Non-OpenBSD BSDs at least support UTF-8 locales. (It's unclear to me how exactly the defaults work.)

As noted above, libc should only be used as a wrapper for system calls.

It is indeed the case that Windows doesn't allow UTF-8 as the codepage of a process. Concluding that Rust's CString is wrong is the wrong conclusion though. The right conclusion is that on Windows only the UTF-16 APIs should be used to interact with the system.

Again, libc should only be used as a system call wrapper, and that part of libc doesn't care about the character encoding.

Here we agree. The conclusion shouldn't be for Rust to accommodate the bogosity of the C standard library but the conclusion should be to treat the text processing parts of the C standard library as so fundamentally wrong as to not provide any accommodations for using them. (C code shouldn't use the text processing parts of the C standard library, either.)

Don't use codepage 500. It's not the default either for terminal or for "ANSI" mode APIs for any Windows locale. Setting the code page for a process to something weird that's supported by Microsoft's converters and that isn't prohibited like UTF-8 is a self-inflicted problem. Rust doesn't need to fix it.

I strongly disagree with adding encoding conversion functionality for legacy encodings other than UTF-16 to the standard library. As noted above about UTF-16 being the right way to interface with Windows and UTF-8 being either the only way, the default way, or the only sensible way to deal with non-Windows systems. APIs that take non-UTF strings should be shunned and avoided. To the extent there exists legacy-encoded data in the world, Rust programs should perform conversion to UTF-8 immediately after performing byte-oriented input operations, but the standard library should not be bloated with the conversion tables.

The scope of what encodings someone might find a fringe use case for is too vast for the standard library. Doing away with legacy encodings is a feature, and it's great that Rust has that feature.

It is a good thing if software developers stop allowing fringe legacy configurations (which is what Posixish platforms with non-UTF-8 locale configurations are) to inflict negative externalities on them.

People who configure a non-Windows system with a non-UTF-8 locale this day and age are engaging in anti-social (towards software developers) fringe activity, and I strongly think that they should bear the cost themselves and the Rust community should refuse to complicate the standard library to accommodate such behavior.

(For Windows, use UTF-16 to sidestep the legacy locale-dependent stuff.)

That C11 doesn't unambiguously make the interpretation of char16_t strings UTF-16 and the interpretation of char32_t strings UTF-32 highlights how badly in the weeds the C standards committee is when it comes to i18n and text processing. That there exist or have existed systems where signed integers are not two's complement but that at the time char16_t and char32_t were added there was no present or foreseeable reasonable interpretation for them other than UTF-16 and UTF-32, respectively, shows how utterly unreasonable the C standard is on these matters: it makes even less sense than not to commit to two's complement. Considering that Rust already doesn't seek to interoperate with a non-two's complement C implementations, it shouldn't be a radical idea that Rust shouldn't try to interoperate with C implementations that give char16_t and char32_t an interpretation other than UTF-16 and UTF-32.

But the issue is mostly moot, since, again, Rust should only use libc as a system call interface and avoid the userland text processing parts.

wchar_t is such a disaster that it shouldn't be used for any purpose other than calling into UTF-16 Windows APIs declared to take wchar_t (in which case the Rust type is known to always be u16).

Out of curiosity, was this something like taking a two-byte EUC sequence as a big-endian integer and putting that into wchar_t?

More likely it means that it's a leftover from the era when Microsoft worked for IBM (and IBM wants its catalog of legacy encodings supported) and nobody thought to explicitly prohibit it as the code page of a Windows process (like UTF-8 is explicitly prohibited).

One might take the view that Windows is very large like the Web is very large and at that scale there is always someone who does every ill-advised thing. Still, I think that Rust should not in any way accommodate someone doing this particular ill-advised thing.

Furthermore, I think that software designers should resist the lure of legacy encodings. They are an attractive nuisance. Legacy encodings fascinate software developers, because there's so much opportunity to geek out about all manner of quirks and weirdness. But it's wrong to think that there is value to legacy encodings or that you gotta catch them all. They are a problem that should be dealt with only to the extent needed to make sense of legacy data and no more.

(Note how CONTRIBUTING.md for encoding_rs sets clear expectations of scope.)

Yeah, if the current UTF-16 facilities in the standard library aren't convenient enough for interfacing with Windows, introducing something that makes dealing with the fact that NT uses UTF-16 but Rust uses UTF-8 more convenient would be appropriate.

15 Likes