Non-UTF-8 encodings on Unix-like systems are a kind of safety problem as explained in the OpenBSD slides I linked to upthread. The safety-oriented approach to Unix-like systems these days being almost always UTF-8 is to help push it to "always" without "almost" instead of counteracting that development by facilitating continued unsafe practices.
Additionally, developing a Rust library that replicated exactly each legacy encoding on each Unix-like system would be a very large undertaking (and would bloat the standard library with code that would be virtually unused). The easier way to match the system's notion for legacy encodings would be to use iconv, but at least the GNU implementation does not appear to be well-optimized for performance and, as such, not at the level I'd expect of a feature exposed by the Rust standard library.
Unix uses supersets of ASCII. EUC (in contrast with ISO-2022) isn't short for Extended Unix Code by accident.
I would actually consider the use of the String type to be adequate encoding-specification for that very reason! But it only specifies the encoding of the data that's actually in a String or &str, not the encoding of anything in a CString or &Cstr. If I understand your original comment correctly, you proposed replacing to_str with to_utf8 for going from CString to String. But the use of &str for the return type is, as you state, already an indication of the destination encoding. My point is that both the current name (to_str) and your proposed name (to_utf8) both fail to specify the source encoding, i.e. the encoding of the CString. (The docs do indicate pretty clearly that the source must be "valid UTF-8", though, in order for the function to succeed.)
I think the specific methods to_str and to_string_lossy are not really problematic enough to rename, but if I were designing CString from scratch, the name of to_str would be as_str, because "to" seems to indicate that some kind of data-conversion may be taking place, when in fact all that's actually taking place is a (validated) cast.
For conversions involving UTF-8, I think that absolutely makes sense! E.g., for a hypothetical conversion method from EBCDIC:
I can see reasonable arguments for both as_str and to_str. On the one hand, “as_” implies trivial (usually constant-time) effort to do the conversion, and UTF-8 validation isn’t trivial. On the other hand, “to_” implies the underlying data is either changing structure or being cloned, which is clearly not the case. On the first hand again, “as_” also implies no changes to the data, which typically also implies infallibility (how can you fail if you “aren’t doing anything”?), but the UTF-8 validation involves either possible failure or changing the data by replacing invalid byte sequences with the replacement character.
Agreed. That’s really the only reasonable way to do things, and it’s what Python and Perl do. You need to explicitly specify an encoding for all input, and explicitly encode all output. In Rust, we can use the type system to handle both of those things.
I think it is a bad idea to include legacy encodings (which probably involve large encoding tables) in the standard library. It might be ok for std to have the API for such encodimgs, and implementations for utf8 and utf16 though.
I'm starting to think that maybe the Right Thing is for CStr(ing) to insist on being UTF-8, in the same way that str(ing) do. The only difference between CStr and str would be that CStr is guaranteed nul-terminated and its len() may be O(n); conversely str is not guaranteed nul-terminated, may contain internal U+00000000, and its len() is O(1).
This also clarifies the difference between CStr and OsStr. An OsStr's job is to faithfully round-trip whatever nonsense we got from the operating system, and that means it is in an uncertain encoding. It may even be in more than one encoding (consider the perfectly valid, as far as a Unix kernel is concerned, pathname /Пользователей/暁美 ほむら/זֹהַר.pdf where the first component is encoded in KOI8-R, the second in EUC-JP, and the third in ISO-8859-8). Conversions to printable strings have to be checked. It seems likely that nul-termination will also in practice be part of OsStr's contract, since most of the APIs it would be used with do in fact take nul-terminated strings, but it might be convenient to not make that part of the official API contract; if nothing else, a hypothetical future all-Rust OS would maybe like to equate OsStr with str.
(Is it feasible, I wonder, to make OsStr be UTF-16 on Windows? That would mesh well with a strict policy of using only the W interfaces.)
I agree that, at least for now, it makes sense to relegate legacy encoding support to crates.
While I agree that the wchar.h interfaces in particular should never be used, I hesitate to say that none of the components of libc that may perform locale-aware text processing should be used from Rust. The exception that comes to mind is getaddrinfo with AI_IDN; bypassing libc's name resolver is a Bad Plan for most programs, and so is reimplementing IDNA, but AI_IDN works from the active locale. (I might propose AI_UTF8 to the glibc people.)
No argument from me there, either.
I don't remember for sure, but yes, that sounds right. It was a long time ago and it was also very poorly documented.
I may be missing something, but that sounds...very pointless, even wrong, to me. CStr(ing), per the name, appears to be designed to work like the classic C-style string, that is, a series of non-null bytes followed by a null byte. As noted throughout the thread, this has nothing to do with encoding; and, per the above discussion, this seems like a necessary type of data to work with.
Between these two changes, what infrastructure would be left for interfacing with raw (non-wide) char-data of unknown encoding from C and C++ libraries on Windows?
Very minor point, but on unix and redox OsStr(ing) is already implemented as a simple [u8] with no trailing null byte, so I’d agree with keeping the trailing null unique to CStr(ing).
In general, I think it makes way more sense to let CString be “unknown encoding” rather than UTF-8 and OSString be “whatever encoding the OS uses” rather than “the UTF that’s closest to whatever the OS uses”. In particular, I think it should be legal to have CStrings of several different encodings running around in the same program, including arbitrarily weird legacy encodings Rust will never have any dedicated support for, while OSString should be restricted to whatever encoding(s?) the target OS accepts and/or outputs. Which is exactly what I thought those types already meant, so afaik I’m proposing no changes at all (except maybe adding some of the new methods others have suggested).
OsString on Windows is WTF-8. This makes more sense than making it UTF-16 internally: From the Rust perspective, this makes OsString on Windows as similar as feasible to the Unix case (Basic Latin shows up as ASCII bytes) but the representation can round trip sequences of 16-bit code units that aren't valid UTF-16, which can be obtained from the W APIs on Windows. Since NTFS paths are 16-bits code units without a guarantee of UTF-16 validity and PathBuf is internally an OsString, WTF-8 instead of UTF-8 is important for the same reason why OsString on Unix is Vec<u8> instead of String.
I don't see why it would be a good idea to leave IDNA to the C library when 1) it's a LC_CTYPE-sensitive libc API and LC_CTYPE-sensitive libc APIs are bad and b) libc implementors' IDNA version politics may differ from what's needed for interop (browser compat) in the real world. It seems to me that IDNA to Punycode mapping is the kind of userland operation that Rust programs should do in Rust.