Here’s my perspective on the Windows side of things:
On Windows, any C program with a int main(int argc, char** argv) already incurs data loss before the program even starts running (when the command-line arguments are converted to MBCS by the C runtime); and thus may be unable to open argv as a file.
Given the existence of files that cannot be named in MBCS, any correct way to handle text on Windows MUST NOT involve MBCS. There are three main approaches to handle text on Windows:
The Microsoft-recommended way: build with -DUNICODE, use TCHAR/WCHAR everywhere.
Rust code interfacing with C code written in this manner will have to use OsString / UTF16 wchar_t.
Use UTF-8 char* internally; convert from/to UTF-16 wchar_t when interfacing with the Windows API or C-runtime. This lets most of the code use char* consistently across platforms.
The current CString works great to interface with such C code.
The legacy way: Use MBCS char*, and accept the data loss on inputs. Blame the users when they put non-MBCS characters in their filenames. Such software also often has internal encoding bugs where it confuses UTF-8 file contents with MBCS. Simple solution: blame the users as soon as they use non-ASCII characters. (this approach works well because most users can’t tell the difference between non-MBCS and non-ASCII)
Rust can kinda interface with such C code – CString works, but the CString<->String conversion “works” only if you use the “blame users on non-ASCII” approach.
I think Rust should add MBCS conversion functions to CString. Rust could also deprecate+rename the existing UTF-8 CString conversion functions, but I don’t think that’s necessary.
Adding parameters to CString conversion methods indicating only one of the encodings involved seems incorrect to me, since perhaps the most important fact about CString encoding (I think) is that CStrings don’t have a programmatically type-system-specified encoding. So, for instance, to_utf8 wouldn’t be correct, because it wouldn’t indicate what assumptions are made about the encoding used in the CString itself.
If standard library functionality for converting to/from arbitrary encodings is provided in the Rust standard library, I’d suggest that the API must clearly indicate (or parameterize) both the source and the destination encodings. I’d also suggest that while CString might be a good choice of types to use for the inputs and outputs of conversion functions, it’s not clear to me that the functions themselves should necessarily be methods of CString. In particular, going from 8-bit encodings to Windows wide encodings and vice-versa should probably involve different data types, one (possibly CString) containing u8s and the other containing u16s.
Sorry, I mean you don’t need to specify both the source and destination encodings, because one of them (the rust string) will always be utf8. In fact no encoding/decoding is necessary if the CString is in utf8.
I agree with your point about not having the methods on CString though. I think CString should be treated like a Vec<[u8]>, but with a terminating null byte. As such I think it makes sense for the conversion methods to be owned by String (and &STR) as they are for byte arrays.
Here are likely explanations why this hasn’t blown up:
As BurntSushi points out, Rust’s CString doesn’t actually assume anything about the encoding (as it mostly deals with [u8] and Vec<u8>). It only makes it easy to convert to Rust strings when you can assume UTF-8.
On Windows, as you note, UTF-8 cannot be set as the encoding used by system APIs that take strings consisting of 8-bit code units, so using CString with Windows system APIs is wrong (except when working around specific bugs in Windows) and, instead, on Windows, system APIs that take UTF-16 strings should be (and are) used instead.
On non-Windows platforms, the reason why Rust code needs to call into libc is to perform system calls. The part of libc that wraps system calls does not perform locale-aware text processing and is, therefore, oblivious to the codeset of strings. (The wide-character IO part of libc should never be used.) The part of libc that is locale-aware performs userland operations, and those operations should be implemented in Rust instead of being delegated to the unreasonably-designed C standard library functions. (C programs shouldn’t call those functions, either, to escape the fundamental misdesign of the C standard library!)
The bogus i18n approach of the C standard library where the interpretation of strings in APIs depends on the configuration environment does not apply to all C libraries. As a particular prominent example, in all GLib-based code (Gnome, etc., including Cairo that you mention) strings are always UTF-8 (except for filenames which are opaque bytes unless displayed to the user and which are interpreted as UTF-8 for display purposes by default; to override the default display interpretation, you need to set the G_BROKEN_FILENAMES environment variable, which indicates clueful attitude towards these issues).
On macOS, iOS, Android, Sailfish and OpenBSD the string encoding in system C string text APIs is UTF-8 (though OpenBSD also supports the POSIX C locale that makes the encoding US-ASCII).
Red Hat has defaulted to UTF-8 since 2002, SuSE since 2004 and Debian since 2007 (the most prominent Debian derivative, Ubuntu, defaulted to UTF-8 before Debian itself).
Solaris has at least supported UTF-8 locales since Solaris 8. Non-OpenBSD BSDs at least support UTF-8 locales. (It’s unclear to me how exactly the defaults work.)
As noted above, libc should only be used as a wrapper for system calls.
It is indeed the case that Windows doesn’t allow UTF-8 as the codepage of a process. Concluding that Rust’s CString is wrong is the wrong conclusion though. The right conclusion is that on Windows only the UTF-16 APIs should be used to interact with the system.
Again, libc should only be used as a system call wrapper, and that part of libc doesn’t care about the character encoding.
Here we agree. The conclusion shouldn’t be for Rust to accommodate the bogosity of the C standard library but the conclusion should be to treat the text processing parts of the C standard library as so fundamentally wrong as to not provide any accommodations for using them. (C code shouldn’t use the text processing parts of the C standard library, either.)
Don’t use codepage 500. It’s not the default either for terminal or for “ANSI” mode APIs for any Windows locale. Setting the code page for a process to something weird that’s supported by Microsoft’s converters and that isn’t prohibited like UTF-8 is a self-inflicted problem. Rust doesn’t need to fix it.
I strongly disagree with adding encoding conversion functionality for legacy encodings other than UTF-16 to the standard library. As noted above about UTF-16 being the right way to interface with Windows and UTF-8 being either the only way, the default way, or the only sensible way to deal with non-Windows systems. APIs that take non-UTF strings should be shunned and avoided. To the extent there exists legacy-encoded data in the world, Rust programs should perform conversion to UTF-8 immediately after performing byte-oriented input operations, but the standard library should not be bloated with the conversion tables.
The scope of what encodings someone might find a fringe use case for is too vast for the standard library. Doing away with legacy encodings is a feature, and it’s great that Rust has that feature.
It is a good thing if software developers stop allowing fringe legacy configurations (which is what Posixish platforms with non-UTF-8 locale configurations are) to inflict negative externalities on them.
People who configure a non-Windows system with a non-UTF-8 locale this day and age are engaging in anti-social (towards software developers) fringe activity, and I strongly think that they should bear the cost themselves and the Rust community should refuse to complicate the standard library to accommodate such behavior.
(For Windows, use UTF-16 to sidestep the legacy locale-dependent stuff.)
That C11 doesn’t unambiguously make the interpretation of char16_t strings UTF-16 and the interpretation of char32_t strings UTF-32 highlights how badly in the weeds the C standards committee is when it comes to i18n and text processing. That there exist or have existed systems where signed integers are not two’s complement but that at the time char16_t and char32_t were added there was no present or foreseeable reasonable interpretation for them other than UTF-16 and UTF-32, respectively, shows how utterly unreasonable the C standard is on these matters: it makes even less sense than not to commit to two’s complement. Considering that Rust already doesn’t seek to interoperate with a non-two’s complement C implementations, it shouldn’t be a radical idea that Rust shouldn’t try to interoperate with C implementations that give char16_t and char32_t an interpretation other than UTF-16 and UTF-32.
But the issue is mostly moot, since, again, Rust should only use libc as a system call interface and avoid the userland text processing parts.
wchar_t is such a disaster that it shouldn’t be used for any purpose other than calling into UTF-16 Windows APIs declared to take wchar_t (in which case the Rust type is known to always be u16).
Out of curiosity, was this something like taking a two-byte EUC sequence as a big-endian integer and putting that into wchar_t?
More likely it means that it’s a leftover from the era when Microsoft worked for IBM (and IBM wants its catalog of legacy encodings supported) and nobody thought to explicitly prohibit it as the code page of a Windows process (like UTF-8 is explicitly prohibited).
One might take the view that Windows is very large like the Web is very large and at that scale there is always someone who does every ill-advised thing. Still, I think that Rust should not in any way accommodate someone doing this particular ill-advised thing.
Furthermore, I think that software designers should resist the lure of legacy encodings. They are an attractive nuisance. Legacy encodings fascinate software developers, because there’s so much opportunity to geek out about all manner of quirks and weirdness. But it’s wrong to think that there is value to legacy encodings or that you gotta catch them all. They are a problem that should be dealt with only to the extent needed to make sense of legacy data and no more.
Yeah, if the current UTF-16 facilities in the standard library aren’t convenient enough for interfacing with Windows, introducing something that makes dealing with the fact that NT uses UTF-16 but Rust uses UTF-8 more convenient would be appropriate.
I don’t know of any useful POSIX functions that care about the setlocale encoding - file I/O functions, getaddrinfo and dlsym expect the input to be in the ill-defined “system encoding” (which is not affected by LC_CTYPE).
It seems that it’s mostly Windows A functions that care about the process locale, and they are deprecated (Rust programs should call the W functions instead).
Am I the only person here getting a headache trying to understand the subtleties of the C standard and how it affects Rust? How would a user like myself know in an approachable and portable way which C functions depend on setLocale() without reading the C standard?
I think the idea to remove Rust’s dependency on libc to interact with the underlying system just got my support. That would require wrapping the platform specific APIs ourselves in Rust instead of relying on libc to be the portable layer, but at least the semantics will be properly defined and safe without crazy C loopholes.
Non-UTF-8 Unix: If there is a problem, it is nowhere near limited to CString. OsStr is just an arbitrary collection of bytes, but OsStr::to_str and friends assume UTF-8; this is also what you get from env::args, File::read_to_string, and others. On the output side, io::Write::write_fmt assumes UTF-8, so plain old println! is broken in a non-UTF-8 locale. Properly supporting non-UTF-8 systems would require adding conversions in all these places, which I suspect is not going to happen.
If it does, I suppose it would be worth distinguishing “theoretically current-C-encoding-encoded bag of bytes” C strings, as used by some libc functions, and “theoretically UTF-8-encoded bag of bytes”, as used by glib and other libraries. But to be clear, this only matters for display and user input purposes. For all other purposes, you want to preserve the original binary blobs.
(In theory it also matters for hardcoded strings, such as standard path names (e.g. /dev/null). But I don’t think there is any non-negligible use of non-ASCII-superset encodings as C locales, such as, e.g., UTF-16 or EBCDIC; so it should be safe to encode in ASCII.)
Non-UTF-8 encodings on Unix-like systems are a kind of safety problem as explained in the OpenBSD slides I linked to upthread. The safety-oriented approach to Unix-like systems these days being almost always UTF-8 is to help push it to “always” without “almost” instead of counteracting that development by facilitating continued unsafe practices.
Additionally, developing a Rust library that replicated exactly each legacy encoding on each Unix-like system would be a very large undertaking (and would bloat the standard library with code that would be virtually unused). The easier way to match the system’s notion for legacy encodings would be to use iconv, but at least the GNU implementation does not appear to be well-optimized for performance and, as such, not at the level I’d expect of a feature exposed by the Rust standard library.
Unix uses supersets of ASCII. EUC (in contrast with ISO-2022) isn’t short for Extended Unix Code by accident.
I would actually consider the use of the String type to be adequate encoding-specification for that very reason! But it only specifies the encoding of the data that’s actually in a String or &str, not the encoding of anything in a CString or &Cstr. If I understand your original comment correctly, you proposed replacing to_str with to_utf8 for going from CString to String. But the use of &str for the return type is, as you state, already an indication of the destination encoding. My point is that both the current name (to_str) and your proposed name (to_utf8) both fail to specify the source encoding, i.e. the encoding of the CString. (The docs do indicate pretty clearly that the source must be “valid UTF-8”, though, in order for the function to succeed.)
I think the specific methods to_str and to_string_lossy are not really problematic enough to rename, but if I were designing CString from scratch, the name of to_str would be as_str, because “to” seems to indicate that some kind of data-conversion may be taking place, when in fact all that’s actually taking place is a (validated) cast.
For conversions involving UTF-8, I think that absolutely makes sense! E.g., for a hypothetical conversion method from EBCDIC:
I can see reasonable arguments for both as_str and to_str. On the one hand, “as_” implies trivial (usually constant-time) effort to do the conversion, and UTF-8 validation isn’t trivial. On the other hand, “to_” implies the underlying data is either changing structure or being cloned, which is clearly not the case. On the first hand again, “as_” also implies no changes to the data, which typically also implies infallibility (how can you fail if you “aren’t doing anything”?), but the UTF-8 validation involves either possible failure or changing the data by replacing invalid byte sequences with the replacement character.
Agreed. That’s really the only reasonable way to do things, and it’s what Python and Perl do. You need to explicitly specify an encoding for all input, and explicitly encode all output. In Rust, we can use the type system to handle both of those things.
I think it is a bad idea to include legacy encodings (which probably involve large encoding tables) in the standard library. It might be ok for std to have the API for such encodimgs, and implementations for utf8 and utf16 though.
I’m starting to think that maybe the Right Thing is for CStr(ing) to insist on being UTF-8, in the same way that str(ing) do. The only difference between CStr and str would be that CStr is guaranteed nul-terminated and its len() may be O(n); conversely str is not guaranteed nul-terminated, may contain internal U+00000000, and its len() is O(1).
This also clarifies the difference between CStr and OsStr. An OsStr's job is to faithfully round-trip whatever nonsense we got from the operating system, and that means it is in an uncertain encoding. It may even be in more than one encoding (consider the perfectly valid, as far as a Unix kernel is concerned, pathname /Пользователей/暁美 ほむら/זֹהַר.pdf where the first component is encoded in KOI8-R, the second in EUC-JP, and the third in ISO-8859-8). Conversions to printable strings have to be checked. It seems likely that nul-termination will also in practice be part of OsStr's contract, since most of the APIs it would be used with do in fact take nul-terminated strings, but it might be convenient to not make that part of the official API contract; if nothing else, a hypothetical future all-Rust OS would maybe like to equate OsStr with str.
(Is it feasible, I wonder, to make OsStr be UTF-16 on Windows? That would mesh well with a strict policy of using only the W interfaces.)
I agree that, at least for now, it makes sense to relegate legacy encoding support to crates.
While I agree that the wchar.h interfaces in particular should never be used, I hesitate to say that none of the components of libc that may perform locale-aware text processing should be used from Rust. The exception that comes to mind is getaddrinfo with AI_IDN; bypassing libc’s name resolver is a Bad Plan for most programs, and so is reimplementing IDNA, but AI_IDN works from the active locale. (I might propose AI_UTF8 to the glibc people.)
No argument from me there, either.
I don’t remember for sure, but yes, that sounds right. It was a long time ago and it was also very poorly documented.
I may be missing something, but that sounds…very pointless, even wrong, to me. CStr(ing), per the name, appears to be designed to work like the classic C-style string, that is, a series of non-null bytes followed by a null byte. As noted throughout the thread, this has nothing to do with encoding; and, per the above discussion, this seems like a necessary type of data to work with.
Between these two changes, what infrastructure would be left for interfacing with raw (non-wide) char-data of unknown encoding from C and C++ libraries on Windows?
Very minor point, but on unix and redox OsStr(ing) is already implemented as a simple [u8] with no trailing null byte, so I’d agree with keeping the trailing null unique to CStr(ing).
In general, I think it makes way more sense to let CString be “unknown encoding” rather than UTF-8 and OSString be “whatever encoding the OS uses” rather than “the UTF that’s closest to whatever the OS uses”. In particular, I think it should be legal to have CStrings of several different encodings running around in the same program, including arbitrarily weird legacy encodings Rust will never have any dedicated support for, while OSString should be restricted to whatever encoding(s?) the target OS accepts and/or outputs. Which is exactly what I thought those types already meant, so afaik I’m proposing no changes at all (except maybe adding some of the new methods others have suggested).
OsString on Windows is WTF-8. This makes more sense than making it UTF-16 internally: From the Rust perspective, this makes OsString on Windows as similar as feasible to the Unix case (Basic Latin shows up as ASCII bytes) but the representation can round trip sequences of 16-bit code units that aren’t valid UTF-16, which can be obtained from the W APIs on Windows. Since NTFS paths are 16-bits code units without a guarantee of UTF-16 validity and PathBuf is internally an OsString, WTF-8 instead of UTF-8 is important for the same reason why OsString on Unix is Vec<u8> instead of String.
I don’t see why it would be a good idea to leave IDNA to the C library when 1) it’s a LC_CTYPE-sensitive libc API and LC_CTYPE-sensitive libc APIs are bad and b) libc implementors’ IDNA version politics may differ from what’s needed for interop (browser compat) in the real world. It seems to me that IDNA to Punycode mapping is the kind of userland operation that Rust programs should do in Rust.